Andromeda

Senior Site Reliability Engineer – AI Infrastructure

Andromeda

full-time

Posted on:

Location Type: Remote

Location: CaliforniaUnited States

Visit company website

Explore more

AI Apply
Apply

Job Level

About the role

  • Design and evolve multi-provider, multi-region GPU compute clusters optimized for large-scale training
  • Serve as the primary technical point of contact for customers running large-scale training workloads
  • Define SLOs and error budgets that account for the unique failure modes of GPU infrastructure
  • Ensure the health and performance of high-speed interconnects
  • Build deep visibility into GPU utilization, memory pressure, interconnect throughput
  • Build production-grade automation for cluster provisioning, GPU health checks, job scheduling
  • Lead incident response for complex failures spanning hardware, networking, orchestration

Requirements

  • Deep, hands-on experience operating large-scale GPU clusters (NVIDIA A100/H100/B200 or equivalent)
  • Production experience with InfiniBand, RoCE, or NVLink fabrics in the context of distributed training
  • Working knowledge of NCCL, CUDA, PyTorch distributed, DeepSpeed, Megatron, FSDP, or similar
  • Expert-level Linux knowledge
  • Strong experience running Kubernetes in production with GPU workloads
  • Strong engineering skills in Python, Go, or Bash
  • Hands-on experience building monitoring and alerting for GPU infrastructure
  • Proven track record leading incident response for complex distributed systems
Benefits
  • Health insurance
  • Retirement plans
  • Paid time off
  • Flexible work arrangements
  • Professional development
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
GPU clustersNVIDIA A100InfiniBandRoCENVLinkNCCLCUDAPyTorchKubernetesPython
Soft Skills
leadershipincident responsecommunication