
AI Cluster Architect
Vultr
full-time
Posted on:
Location Type: Remote
Location: United States
Visit company websiteExplore more
Salary
💰 $165,000 - $185,000 per year
Tech Stack
About the role
- Architect large-scale GPU clusters within fixed site power budgets that optimizes for maximum GPU density while reserving necessary headroom for compute services, storage, and networking.
- Model and validate power consumption across the full cluster bill of materials (GPUs, CPUs, NICs, switches, fabric components, storage, and facility limits).
- Evaluate tradeoffs across multiple fabric networking architectures (InfiniBand, RoCE, SpectrumX) as well as multi-plane, 2-tier/3-tier, and rail-optimized topologies.
- Determine network scale limits based on switch radix, link speed, topology, and blocking requirements.
- Gather, interpret, and maintain detailed SKU-level power and thermal specifications for GPUs, NICs, switches, DPUs, storage, and server platforms.
- Develop power-aware cluster configuration templates and capacity-planning models that can scale across sites with varying constraints and allow for quick iteration and ideation.
- Document architecture, design choices, tradeoff analyses, and operational considerations for deployment and lifecycle management.
- Provide guidance on future-proofing, including the ability to incorporate next-gen GPUs, NICs, or fabrics.
- Collaborate with vendors on novel fabric architectures that enable large-scale cluster deployments (100k+ GPUs)
Requirements
- 7+ years designing or building large-scale HPC, AI, or hyperscale GPU clusters.
- Expert understanding of GPU and accelerator system design, including node topology, PCIe/NVLink/NVSwitch/ROCm, and NIC-to-GPU affinity considerations.
- Strong familiarity with InfiniBand, RoCE, and SpectrumX networking, including multi-tier, multi-plane, Clos/dragonfly variants, and large-radix switch design.
- Demonstrated experience modeling power draw and thermal characteristics of servers, GPUs, NICs, switches, optics, and storage systems.
- Ability to design networks that maintain full non-blocking performance or intentionally introduce over/under-subscription while understanding impacts on workload performance.
- Proven ability to gather and analyze vendor SKU-level specifications and incorporate them into scalable cluster architectures.
- Experience balancing customer-driven requirements for compute, storage, and service density in combination with overall GPU count.
- Strong documentation, communication, and cross-functional collaboration skills.
Benefits
- Excellent Medical Benefits w/ 100% company-paid premiums for employee only plan + 100% company-paid dental & vision premiums
- 401(k) plan that matches 100% up to 4% with immediate vesting
- Professional Development Reimbursement of $2,500 each year
- 11 Holidays + Paid Time Off Accrual + Rollover Plan + take your birthday off
- Commitment matters to Vultr! Increased PTO at 3 year & 10 year anniversary + 1 month paid sabbatical every 5 years + Anniversary Bonus each year
- $500 first year remote office setup + $400 each following year for new equipment
- Internet reimbursement up to $75 per month
- Gym membership reimbursement up to $50 per month
- Company-paid Wellable subscription
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
GPU architectureHPC designAI cluster designpower modelingthermal modelingInfiniBandRoCESpectrumXPCIeNVLink
Soft Skills
documentationcommunicationcross-functional collaborationanalytical skillsproblem-solving