Lavendo

HPC Solutions Architect

Lavendo

full-time

Posted on:

Location Type: Remote

Location: CaliforniaUnited States

Visit company website

Explore more

AI Apply
Apply

Salary

💰 $225,000 - $315,000 per year

About the role

  • Architect and implement HPC clusters for AI, simulation, and distributed training using Kubernetes and schedulers like Slurm.
  • Integrate NVIDIA Hopper and Blackwell‑class GPUs with NVLink/NVSwitch and InfiniBand/RoCE.
  • Deploy and manage GPU Operator and Network Operator for large fleets.
  • Design and validate cloud‑native HPC environments with low latency and high bandwidth.
  • Define and document reference architectures for AI model training and MLOps.
  • Collaborate with NVIDIA and other partners to evaluate new GPU generations and software stacks.
  • Benchmark performance, track down bottlenecks, and recommend concrete changes.
  • Lead design sessions and architecture reviews with customers focused on performance and reliability.

Requirements

  • A Bachelor’s or Master’s in Computer Science, Engineering, or a related field (PhD is a plus).
  • 3+ years actually building or running HPC or large GPU clusters—on‑prem, cloud, or hybrid.
  • Strong Linux background, plus Kubernetes and container runtimes (containerd, CRI‑O, Docker) in real environments, with CI/CD in the loop.
  • A solid handle on HPC networking and RDMA: InfiniBand, RoCE, NVLink/NVSwitch.
  • Experience with storage and I/O for big workloads: Ceph, Lustre, NFS at scale, GPUDirect Storage, or similar systems.
  • Comfort with Terraform, Ansible, Helm, and GitOps‑style workflows.
  • Good scripting skills in Python or Bash.
  • You write and speak clearly, can lead a design review without losing the room, and can keep both engineers and non‑technical stakeholders on the same page.
  • Legal authorization to work in the U.S. on a full-time basis without visa sponsorship.
Benefits
  • 100% employer‑paid medical, dental, and vision for you and your family
  • 4% 401(k) match with immediate vesting
  • Company‑paid short‑ and long‑term disability and life insurance
  • 20 weeks paid parental leave for primary caregivers, 12 weeks for secondary
  • Support for your home office (mobile + internet stipend)
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
HPCKubernetesLinuxGPU clustersRDMAInfiniBandRoCEPythonBashCI/CD
Soft Skills
communicationleadershipcollaborationdesign reviewdocumentation