Senior Site Reliability Engineer – AI Infrastructure

Andromeda

full-time

Posted on: 4/9/2026

Location Type: Remote

✨ AI Apply

About the role

Design and evolve multi-provider, multi-region GPU compute clusters optimized for large-scale training
Serve as the primary technical point of contact for customers running large-scale training workloads
Define SLOs and error budgets that account for the unique failure modes of GPU infrastructure
Ensure the health and performance of high-speed interconnects
Build deep visibility into GPU utilization, memory pressure, interconnect throughput
Build production-grade automation for cluster provisioning, GPU health checks, job scheduling
Lead incident response for complex failures spanning hardware, networking, orchestration

Deep, hands-on experience operating large-scale GPU clusters (NVIDIA A100/H100/B200 or equivalent)
Production experience with InfiniBand, RoCE, or NVLink fabrics in the context of distributed training
Working knowledge of NCCL, CUDA, PyTorch distributed, DeepSpeed, Megatron, FSDP, or similar
Expert-level Linux knowledge
Strong experience running Kubernetes in production with GPU workloads
Strong engineering skills in Python, Go, or Bash
Hands-on experience building monitoring and alerting for GPU infrastructure
Proven track record leading incident response for complex distributed systems

Benefits

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

GPU clustersNVIDIA A100InfiniBandRoCENVLinkNCCLCUDAPyTorchKubernetesPython

Soft Skills

leadershipincident responsecommunication