FluidStack

Director, Infrastructure

FluidStack

full-time

Posted on:

Location Type: Remote

Location: United States

Visit company website

Explore more

AI Apply
Apply

Salary

💰 $250,000 - $350,000 per year

Job Level

About the role

  • Own the technical design, deployment, and operational reliability of Fluidstack's bare-metal clusters across all production sites, covering compute, storage, and networking infrastructure.
  • Lead the Infrastructure Engineering organization, comprising Networking Engineers, Compute Systems Engineers, and Storage Engineers, with high standards for technical depth, deployment velocity, and on-call reliability.
  • Drive cluster architecture decisions for current-generation GPU systems (NVIDIA, AMD, and other XPUs), including server configuration, frontend and backend fabric design, storage topology, and rack power and cooling envelope.
  • Coordinate with Supply Chain on OEM relationships, hardware specifications, and delivery timelines to ensure the physical infrastructure roadmap stays one step ahead of customer commitments.
  • Partner with Data Center Operations on new site bring-ups, ensuring smooth handoff from civil and MEP completion through network cabling, hardware racking, burn-in, and customer acceptance testing.
  • Work with Software Engineering and SRE to define infrastructure requirements for managed Kubernetes, SLURM, and inference serving, ensuring the physical layer meets the demands of the software stack.
  • Build and maintain deployment tooling, burn-in automation, and hardware lifecycle management systems that enable your team to operate at a pace and reliability level that sets Fluidstack apart.
  • Stay hands-on: participate in design reviews, be present for critical cluster bring-ups, and engage directly with complex infrastructure failures to maintain technical credibility with your team and across the organization.
  • Travel as needed to data centers, OEM facilities, customer sites, and industry events to stay close to the hardware, the partners, and the market.
  • Coordinate with Finance on infrastructure CapEx planning and cost modeling, with Security on hardening and compliance requirements, and with Sales on pre-sales technical diligence and capacity commitments to customers.

Requirements

  • 10+ years of infrastructure engineering experience, with at least 3 years in a technical leadership role managing a team of systems, networking, or storage engineers.
  • Demonstrated ownership of the design, deployment, and operation of a 10,000+ GPU cluster using a recent-generation accelerator (Blackwell, Hopper, or equivalent XPU), from physical hardware bring-up through production steady-state.
  • On-site, hands-on experience physically deploying hardware in data centers, with a clear sense of what it takes to execute a fast, reliable cluster bring-up.
  • Deep expertise in high-performance networking for AI workloads: InfiniBand (XDR/NDR) or RoCEv2 fabric design, large-scale BGP and ECMP architectures, and switch and cable plant management.
  • Strong working knowledge of GPU server hardware internals: NVLink and PCIe topology, NVMe configurations, BMC and firmware management.
  • Experience with high-performance parallel and distributed storage systems for AI training workloads, such as DDN/Lustre, WekaFS, VAST, and open source solutions.
  • Exceptional written and verbal communication skills, with the ability to translate between deep technical detail and high-level summaries for engineering, executive, and customer audiences.
Benefits
  • Competitive total compensation package (salary + equity).
  • Retirement or pension plan, in line with local norms.
  • Health, dental, and vision insurance.
  • Generous PTO policy, in line with local norms.
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
infrastructure engineeringtechnical leadershipGPU cluster designhardware deploymenthigh-performance networkingInfiniBandRoCEv2BGPECMPparallel and distributed storage systems
Soft Skills
communication skillsteam managementtechnical credibilitycollaborationproblem-solvingorganizational skillscustomer engagementleadershipstrategic planningadaptability