HPC Operations Engineer

Lambda

HPC Operations Engineer at Lambda, deploying large-scale clusters for AI workloads. Focusing on operational efficiency, mentoring, and the latest HPC/AI technologies.

Posted 6/6/2026full-timeSan Francisco • California • 🇺🇸 United StatesMid-LevelSenior💰 $240,000 - $356,000 per yearWebsite

ATS Keywords

Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills

HPC clustersAI workloadsoperating systemsfirmwarenetworkingSFP+ fiberInfiniband100 GbE network fabricsLinuxSLURM

Soft Skills

problem solvingtroubleshootingmentoringcommunicationindependenceteamworkflexibility

Tools & Technologies

automation toolsKubernetesNCCLHorovodRDMAGPU directpower infrastructureswitching

Industry Keywords

HPC/AI architectureStandard Operating Proceduresoperational efficiencydeploymentdata centers

Tech Stack

Tools & technologies

KubernetesLinuxSwitching

About the role

Key responsibilities & impact

Remotely deploy and configure large-scale HPC clusters for AI workloads (up to many thousands of nodes)
Remotely install and configure operating systems, firmware, software, and networking on HPC clusters both manually and using automation tools
Troubleshoot and resolve HPC cluster issues working closely with physical deployment teams on-site
Provide clear and detailed requirements back to other engineering teams on gaps and improvement areas, specifically in the areas of simplification, stability, and operational efficiency
Contribute to the creation and maintenance of Standard Operating Procedures
Provide regular and well-communicated updates to project leads throughout each deployment
Mentor and assist less experienced team members
Stay up-to-date on the latest HPC/AI technologies and best practices

Requirements

What you’ll need

5+ years of experience in deploying and configuring HPC clusters for AI workloads
Strong understanding of HPC/AI architecture, operating systems, firmware, software, and networking
Expertise in configuring and troubleshooting: SFP+ fiber, Infiniband (IB), and 100 GbE network fabrics; Ethernet, switching, power infrastructure, GPU direct, RDMA, NCCL, Horovod environments; Linux based compute nodes, firmware updates, driver installation; SLURM, Kubernetes, or other job scheduling systems
Excellent problem solving and troubleshooting skills
Flexibility to travel to North American data centers as on-site needs arise
Ability to work independently and as part of a team
Comfortable mentoring and supporting junior HPC engineers on cluster deployments

Benefits

Comp & perks

Health, dental, and vision coverage for you and your dependents
Wellness and commuter stipends for select roles
401k Plan with 2% company match (USA employees)
Flexible paid time off plan that we all actually use