Vultr

AI Cluster Architect

Vultr

full-time

Posted on:

Location Type: Remote

Location: United States

Visit company website

Explore more

AI Apply
Apply

Salary

💰 $165,000 - $185,000 per year

Job Level

Tech Stack

About the role

  • Architect large-scale GPU clusters within fixed site power budgets that optimizes for maximum GPU density while reserving necessary headroom for compute services, storage, and networking.
  • Model and validate power consumption across the full cluster bill of materials (GPUs, CPUs, NICs, switches, fabric components, storage, and facility limits).
  • Evaluate tradeoffs across multiple fabric networking architectures (InfiniBand, RoCE, SpectrumX) as well as multi-plane, 2-tier/3-tier, and rail-optimized topologies.
  • Determine network scale limits based on switch radix, link speed, topology, and blocking requirements.
  • Gather, interpret, and maintain detailed SKU-level power and thermal specifications for GPUs, NICs, switches, DPUs, storage, and server platforms.
  • Develop power-aware cluster configuration templates and capacity-planning models that can scale across sites with varying constraints and allow for quick iteration and ideation.
  • Document architecture, design choices, tradeoff analyses, and operational considerations for deployment and lifecycle management.
  • Provide guidance on future-proofing, including the ability to incorporate next-gen GPUs, NICs, or fabrics.
  • Collaborate with vendors on novel fabric architectures that enable large-scale cluster deployments (100k+ GPUs)

Requirements

  • 7+ years designing or building large-scale HPC, AI, or hyperscale GPU clusters.
  • Expert understanding of GPU and accelerator system design, including node topology, PCIe/NVLink/NVSwitch/ROCm, and NIC-to-GPU affinity considerations.
  • Strong familiarity with InfiniBand, RoCE, and SpectrumX networking, including multi-tier, multi-plane, Clos/dragonfly variants, and large-radix switch design.
  • Demonstrated experience modeling power draw and thermal characteristics of servers, GPUs, NICs, switches, optics, and storage systems.
  • Ability to design networks that maintain full non-blocking performance or intentionally introduce over/under-subscription while understanding impacts on workload performance.
  • Proven ability to gather and analyze vendor SKU-level specifications and incorporate them into scalable cluster architectures.
  • Experience balancing customer-driven requirements for compute, storage, and service density in combination with overall GPU count.
  • Strong documentation, communication, and cross-functional collaboration skills.
Benefits
  • Excellent Medical Benefits w/ 100% company-paid premiums for employee only plan + 100% company-paid dental & vision premiums
  • 401(k) plan that matches 100% up to 4% with immediate vesting
  • Professional Development Reimbursement of $2,500 each year
  • 11 Holidays + Paid Time Off Accrual + Rollover Plan + take your birthday off
  • Commitment matters to Vultr! Increased PTO at 3 year & 10 year anniversary + 1 month paid sabbatical every 5 years + Anniversary Bonus each year
  • $500 first year remote office setup + $400 each following year for new equipment
  • Internet reimbursement up to $75 per month
  • Gym membership reimbursement up to $50 per month
  • Company-paid Wellable subscription
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
GPU architectureHPC designAI cluster designpower modelingthermal modelingInfiniBandRoCESpectrumXPCIeNVLink
Soft Skills
documentationcommunicationcross-functional collaborationanalytical skillsproblem-solving