
Senior Site Reliability Engineer – AI Infrastructure
Andromeda
full-time
Posted on:
Location Type: Remote
Location: California • United States
Visit company websiteExplore more
Job Level
About the role
- Design and evolve multi-provider, multi-region GPU compute clusters optimized for large-scale training
- Serve as the primary technical point of contact for customers running large-scale training workloads
- Define SLOs and error budgets that account for the unique failure modes of GPU infrastructure
- Ensure the health and performance of high-speed interconnects
- Build deep visibility into GPU utilization, memory pressure, interconnect throughput
- Build production-grade automation for cluster provisioning, GPU health checks, job scheduling
- Lead incident response for complex failures spanning hardware, networking, orchestration
Requirements
- Deep, hands-on experience operating large-scale GPU clusters (NVIDIA A100/H100/B200 or equivalent)
- Production experience with InfiniBand, RoCE, or NVLink fabrics in the context of distributed training
- Working knowledge of NCCL, CUDA, PyTorch distributed, DeepSpeed, Megatron, FSDP, or similar
- Expert-level Linux knowledge
- Strong experience running Kubernetes in production with GPU workloads
- Strong engineering skills in Python, Go, or Bash
- Hands-on experience building monitoring and alerting for GPU infrastructure
- Proven track record leading incident response for complex distributed systems
Benefits
- Health insurance
- Retirement plans
- Paid time off
- Flexible work arrangements
- Professional development
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
GPU clustersNVIDIA A100InfiniBandRoCENVLinkNCCLCUDAPyTorchKubernetesPython
Soft Skills
leadershipincident responsecommunication