Apply

Ready to go for it?

AI Apply speeds things up—apply directly if you prefer.

FREE ACCESS
5,000–10,000 jobs/day
JobTailor Logo

See all jobs on JobTailor

Search thousands of fresh jobs every day.

Discover
  • Fresh listings
  • Fast filters
  • No subscription required
Create a free account and start exploring right away.
Andromeda

Staff SRE, AI Infrastructure

Andromeda

Staff SRE at Andromeda responsible for the reliability of AI infrastructure. Leading incident responses and collaborating with engineering on solutions.

Posted 5/21/2026full-timeRemote • California • 🇺🇸 United StatesLeadWebsite

Tech Stack

Tools & technologies
GoLinuxPythonPyTorchRust

About the role

Key responsibilities & impact
  • Own the reliability of Andromeda's infrastructure end to end
  • Lead top-customer training run responses and write the postmortem
  • Ensure the health of thousands of GPUs across providers
  • Build telemetry, GPU health checks, and automated remediation
  • Define on-call processes like rotations and escalation
  • Be the reliability voice in customer incident reviews
  • Collaborate closely with the product team on SLOs
  • Partner with providers and data center teams on physical design
  • Make other engineers better through mentorship

Requirements

What you’ll need
  • Multiple years building and operating large-scale GPU infrastructure as your primary job
  • A clear history of owning the reliability of load-bearing infrastructure
  • Deep, hands-on with NVIDIA H100/H200/B200/GB200 (or equivalent) at scale
  • Real production experience with InfiniBand, RoCE, and NVLink fabrics
  • Working knowledge of how large training jobs run — NCCL, CUDA, PyTorch distributed
  • Strong Go, Python, or Rust proficiency
  • Expert-level Linux & Systems Internals
  • Comfortable being the senior engineer on a P0 bridge with the customer
  • Comfortable being the senior technical voice with AI infra customers

Benefits

Comp & perks
  • Significant autonomy
  • Working on infrastructure that the most ambitious AI labs depend on

ATS Keywords

✓ Tailor your resume
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
GPU infrastructureNVIDIA H100NVIDIA H200NVIDIA B200NVIDIA GB200InfiniBandRoCENVLinkNCCLCUDA
Soft Skills
mentorshipcollaborationleadershipcommunicationincident review