FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.
Tech Stack
Tools & technologiesGoLinuxPythonPyTorchRust
About the role
Key responsibilities & impact- Own the reliability of Andromeda's infrastructure end to end
- Lead top-customer training run responses and write the postmortem
- Ensure the health of thousands of GPUs across providers
- Build telemetry, GPU health checks, and automated remediation
- Define on-call processes like rotations and escalation
- Be the reliability voice in customer incident reviews
- Collaborate closely with the product team on SLOs
- Partner with providers and data center teams on physical design
- Make other engineers better through mentorship
Requirements
What you’ll need- Multiple years building and operating large-scale GPU infrastructure as your primary job
- A clear history of owning the reliability of load-bearing infrastructure
- Deep, hands-on with NVIDIA H100/H200/B200/GB200 (or equivalent) at scale
- Real production experience with InfiniBand, RoCE, and NVLink fabrics
- Working knowledge of how large training jobs run — NCCL, CUDA, PyTorch distributed
- Strong Go, Python, or Rust proficiency
- Expert-level Linux & Systems Internals
- Comfortable being the senior engineer on a P0 bridge with the customer
- Comfortable being the senior technical voice with AI infra customers
Benefits
Comp & perks- Significant autonomy
- Working on infrastructure that the most ambitious AI labs depend on
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
GPU infrastructureNVIDIA H100NVIDIA H200NVIDIA B200NVIDIA GB200InfiniBandRoCENVLinkNCCLCUDA
Soft Skills
mentorshipcollaborationleadershipcommunicationincident review
