
HPC Solutions Architect
Lavendo
full-time
Posted on:
Location Type: Remote
Location: California • United States
Visit company websiteExplore more
Salary
💰 $225,000 - $315,000 per year
About the role
- Architect and implement HPC clusters for AI, simulation, and distributed training using Kubernetes and schedulers like Slurm.
- Integrate NVIDIA Hopper and Blackwell‑class GPUs with NVLink/NVSwitch and InfiniBand/RoCE.
- Deploy and manage GPU Operator and Network Operator for large fleets.
- Design and validate cloud‑native HPC environments with low latency and high bandwidth.
- Define and document reference architectures for AI model training and MLOps.
- Collaborate with NVIDIA and other partners to evaluate new GPU generations and software stacks.
- Benchmark performance, track down bottlenecks, and recommend concrete changes.
- Lead design sessions and architecture reviews with customers focused on performance and reliability.
Requirements
- A Bachelor’s or Master’s in Computer Science, Engineering, or a related field (PhD is a plus).
- 3+ years actually building or running HPC or large GPU clusters—on‑prem, cloud, or hybrid.
- Strong Linux background, plus Kubernetes and container runtimes (containerd, CRI‑O, Docker) in real environments, with CI/CD in the loop.
- A solid handle on HPC networking and RDMA: InfiniBand, RoCE, NVLink/NVSwitch.
- Experience with storage and I/O for big workloads: Ceph, Lustre, NFS at scale, GPUDirect Storage, or similar systems.
- Comfort with Terraform, Ansible, Helm, and GitOps‑style workflows.
- Good scripting skills in Python or Bash.
- You write and speak clearly, can lead a design review without losing the room, and can keep both engineers and non‑technical stakeholders on the same page.
- Legal authorization to work in the U.S. on a full-time basis without visa sponsorship.
Benefits
- 100% employer‑paid medical, dental, and vision for you and your family
- 4% 401(k) match with immediate vesting
- Company‑paid short‑ and long‑term disability and life insurance
- 20 weeks paid parental leave for primary caregivers, 12 weeks for secondary
- Support for your home office (mobile + internet stipend)
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
HPCKubernetesLinuxGPU clustersRDMAInfiniBandRoCEPythonBashCI/CD
Soft Skills
communicationleadershipcollaborationdesign reviewdocumentation