Site Reliability Engineer

RunPod

Site Reliability Engineer ensuring stability and resilience of Runpod's AI systems platform. Collaborating with engineering teams to improve observability and prevent incidents.

Posted 4/21/2026full-timeRemote • 🇺🇸 United StatesMid-LevelSenior💰 $150,000 - $200,000 per yearWebsite

Tech Stack

Tools & technologies

Distributed SystemsGoGrafanaLinuxPrometheusPython

About the role

Key responsibilities & impact

Increase platform uptime and reduce incident frequency and duration
Establish and operationalize SLIs/SLOs across services
Improve MTTR through better tooling, automation, and runbooks
Strengthen production readiness standards
Drive long-term systemic reliability improvements
Define and implement SLIs/SLOs for critical services
Lead incident response and coordinate cross-team mitigation efforts
Conduct blameless postmortems and ensure corrective actions are completed
Perform production readiness reviews for new services and features
Identify systemic risks and drive preventative improvements
Design and improve monitoring, alerting, and dashboards (Prometheus, Grafana, etc.)
Improve signal-to-noise ratio in alerts and reduce alert fatigue
Build internal tooling for reliability tracking and reporting
Improve visibility into GPU performance and distributed systems health
Automate recurring operational workflows
Build tools and scripts (Python, Go, Bash) to eliminate manual processes
Improve deployment safety through automation and guardrails
Strengthen CI/CD reliability and release processes
Partner with engineering teams to improve system resilience
Provide guidance on fault tolerance, scalability, and failure handling
Contribute to architectural discussions with a reliability-first mindset

Requirements

What you’ll need

5+ years of experience in SRE, Reliability Engineering, or Production Engineering
Strong Linux systems and Networking expertise
Experience managing containerized production systems
Strong understanding of distributed systems and failure modes
Experience defining and managing SLIs/SLOs
Proven incident response and postmortem leadership experience
Strong scripting or programming skills
Experience with monitoring and alerting systems
Excellent written communication skills
Successful completion of a background check

Benefits

Comp & perks

Meaningful equity in a fast-growing company
Generous medical, dental & vision plans
Flexible PTO- take the time you need to recharge
Most roles are remote work first with an inclusive, collaborative teams utilizing slack as the main form of internal communication
Join a passionate team on the cutting edge of AI infrastructure — where culture, learning, and ownership are at the heart of how we scale.

ATS Keywords

✓ Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

SLIsSLOsMTTRmonitoringalertingPythonGoBashCI/CDcontainerized systems

Soft Skills

incident responsepostmortem leadershipcommunicationcollaborationproblem-solvingsystemic risk identificationpreventative improvementsguidancearchitectural discussionsreliability mindset