FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.

HPC Operations Engineer
LambdaHPC Operations Engineer at Lambda, deploying large-scale clusters for AI workloads. Focusing on operational efficiency, mentoring, and the latest HPC/AI technologies.
Posted 6/6/2026full-timeSan Francisco • California • 🇺🇸 United StatesMid-LevelSenior💰 $240,000 - $356,000 per yearWebsite
Tech Stack
Tools & technologiesKubernetesLinuxSwitching
About the role
Key responsibilities & impact- Remotely deploy and configure large-scale HPC clusters for AI workloads (up to many thousands of nodes)
- Remotely install and configure operating systems, firmware, software, and networking on HPC clusters both manually and using automation tools
- Troubleshoot and resolve HPC cluster issues working closely with physical deployment teams on-site
- Provide clear and detailed requirements back to other engineering teams on gaps and improvement areas, specifically in the areas of simplification, stability, and operational efficiency
- Contribute to the creation and maintenance of Standard Operating Procedures
- Provide regular and well-communicated updates to project leads throughout each deployment
- Mentor and assist less experienced team members
- Stay up-to-date on the latest HPC/AI technologies and best practices
Requirements
What you’ll need- 5+ years of experience in deploying and configuring HPC clusters for AI workloads
- Strong understanding of HPC/AI architecture, operating systems, firmware, software, and networking
- Expertise in configuring and troubleshooting: SFP+ fiber, Infiniband (IB), and 100 GbE network fabrics; Ethernet, switching, power infrastructure, GPU direct, RDMA, NCCL, Horovod environments; Linux based compute nodes, firmware updates, driver installation; SLURM, Kubernetes, or other job scheduling systems
- Excellent problem solving and troubleshooting skills
- Flexibility to travel to North American data centers as on-site needs arise
- Ability to work independently and as part of a team
- Comfortable mentoring and supporting junior HPC engineers on cluster deployments
Benefits
Comp & perks- Health, dental, and vision coverage for you and your dependents
- Wellness and commuter stipends for select roles
- 401k Plan with 2% company match (USA employees)
- Flexible paid time off plan that we all actually use
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
HPC clustersAI workloadsoperating systemsfirmwarenetworkingSFP+ fiberInfiniband100 GbE network fabricsLinuxSLURM
Soft Skills
problem solvingtroubleshootingmentoringcommunicationindependenceteamworkflexibility