Apply

Ready to go for it?

AI Apply speeds things up—apply directly if you prefer.

FREE ACCESS
5,000–10,000 jobs/day
JobTailor Logo

See all jobs on JobTailor

Search thousands of fresh jobs every day.

Discover
  • Fresh listings
  • Fast filters
  • No subscription required
Create a free account and start exploring right away.
Lambda

HPC Operations Engineer

Lambda

HPC Operations Engineer at Lambda, deploying large-scale clusters for AI workloads. Focusing on operational efficiency, mentoring, and the latest HPC/AI technologies.

Posted 6/6/2026full-timeSan Francisco • California • 🇺🇸 United StatesMid-LevelSenior💰 $240,000 - $356,000 per yearWebsite

Tech Stack

Tools & technologies
KubernetesLinuxSwitching

About the role

Key responsibilities & impact
  • Remotely deploy and configure large-scale HPC clusters for AI workloads (up to many thousands of nodes)
  • Remotely install and configure operating systems, firmware, software, and networking on HPC clusters both manually and using automation tools
  • Troubleshoot and resolve HPC cluster issues working closely with physical deployment teams on-site
  • Provide clear and detailed requirements back to other engineering teams on gaps and improvement areas, specifically in the areas of simplification, stability, and operational efficiency
  • Contribute to the creation and maintenance of Standard Operating Procedures
  • Provide regular and well-communicated updates to project leads throughout each deployment
  • Mentor and assist less experienced team members
  • Stay up-to-date on the latest HPC/AI technologies and best practices

Requirements

What you’ll need
  • 5+ years of experience in deploying and configuring HPC clusters for AI workloads
  • Strong understanding of HPC/AI architecture, operating systems, firmware, software, and networking
  • Expertise in configuring and troubleshooting: SFP+ fiber, Infiniband (IB), and 100 GbE network fabrics; Ethernet, switching, power infrastructure, GPU direct, RDMA, NCCL, Horovod environments; Linux based compute nodes, firmware updates, driver installation; SLURM, Kubernetes, or other job scheduling systems
  • Excellent problem solving and troubleshooting skills
  • Flexibility to travel to North American data centers as on-site needs arise
  • Ability to work independently and as part of a team
  • Comfortable mentoring and supporting junior HPC engineers on cluster deployments

Benefits

Comp & perks
  • Health, dental, and vision coverage for you and your dependents
  • Wellness and commuter stipends for select roles
  • 401k Plan with 2% company match (USA employees)
  • Flexible paid time off plan that we all actually use

ATS Keywords

✓ Tailor your resume
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
HPC clustersAI workloadsoperating systemsfirmwarenetworkingSFP+ fiberInfiniband100 GbE network fabricsLinuxSLURM
Soft Skills
problem solvingtroubleshootingmentoringcommunicationindependenceteamworkflexibility