Staff SRE, AI Infrastructure

Andromeda

Staff SRE at Andromeda responsible for the reliability of AI infrastructure. Leading incident responses and collaborating with engineering on solutions.

Posted 5/21/2026full-timeRemote • California • 🇺🇸 United StatesLeadWebsite

Tech Stack

Tools & technologies

GoLinuxPythonPyTorchRust

About the role

Key responsibilities & impact

Own the reliability of Andromeda's infrastructure end to end
Lead top-customer training run responses and write the postmortem
Ensure the health of thousands of GPUs across providers
Build telemetry, GPU health checks, and automated remediation
Define on-call processes like rotations and escalation
Be the reliability voice in customer incident reviews
Collaborate closely with the product team on SLOs
Partner with providers and data center teams on physical design
Make other engineers better through mentorship

Requirements

What you’ll need

Multiple years building and operating large-scale GPU infrastructure as your primary job
A clear history of owning the reliability of load-bearing infrastructure
Deep, hands-on with NVIDIA H100/H200/B200/GB200 (or equivalent) at scale
Real production experience with InfiniBand, RoCE, and NVLink fabrics
Working knowledge of how large training jobs run — NCCL, CUDA, PyTorch distributed
Strong Go, Python, or Rust proficiency
Expert-level Linux & Systems Internals
Comfortable being the senior engineer on a P0 bridge with the customer
Comfortable being the senior technical voice with AI infra customers

Benefits

Comp & perks

Significant autonomy
Working on infrastructure that the most ambitious AI labs depend on

ATS Keywords

✓ Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

GPU infrastructureNVIDIA H100NVIDIA H200NVIDIA B200NVIDIA GB200InfiniBandRoCENVLinkNCCLCUDA

Soft Skills

mentorshipcollaborationleadershipcommunicationincident review