Salary
💰 $160,000 - $206,000 per year
Tech Stack
CloudGrafanaKubernetesLinuxPrometheus
About the role
- Act as the primary technical point of escalation for Super Intelligence customers running hyperscale GPU clusters.
- Lead incident response for complex issues, ensuring rapid triage, clear communication, and timely resolution.
- Proactively identify risks in large environments (firmware, performance bottlenecks, orchestration issues) and drive preventative improvements.
- Partner closely with Lambda Engineering and Product teams to influence roadmap decisions based on real customer needs.
- Contribute to runbooks, best practices, and operational guides tailored for hyperscale environments.
- Train and mentor other support engineers, raising the bar across Lambda’s support organization.
- Participate in a rotating on-call schedule, owning critical incidents and high-priority alerts for SI customers.
Requirements
- 7+ years of experience in HPC or cloud support engineering, with customer-facing responsibilities.
- Proven experience managing large-scale Linux clusters and distributed HPC/AI workloads.
- Deep expertise in orchestration tools such as Kubernetes and/or Slurm.
- Strong knowledge of GPU technologies (CUDA, NCCL, MIG, NVLink, GPUDirect RDMA).
- Skilled in high-throughput networking (InfiniBand, RoCE) and cluster storage solutions.
- Familiarity with monitoring/logging platforms (Prometheus, Grafana, Datadog).
- Experience leading incident management and communicating directly with enterprise or hyperscale customers.
- Ability to balance deep technical troubleshooting with clear, concise communication to executives and stakeholders.
- Health, dental, and vision coverage for you and your dependents
- Wellness and Commuter stipends for select roles
- 401k Plan with 2% company match (USA employees)
- Flexible Paid Time Off Plan that we all actually use
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
HPCcloud support engineeringLinux clustersdistributed HPC workloadsorchestration toolsKubernetesSlurmGPU technologiesCUDAhigh-throughput networking
Soft skills
incident managementclear communicationmentoringproblem-solvingrisk identificationcustomer-facingteam collaborationtrainingleadershipconcise communication