Apply

Ready to go for it?

AI Apply speeds things up—apply directly if you prefer.

FREE ACCESS
5,000–10,000 jobs/day
JobTailor Logo

See all jobs on JobTailor

Search thousands of fresh jobs every day.

Discover
  • Fresh listings
  • Fast filters
  • No subscription required
Create a free account and start exploring right away.
RunPod

Site Reliability Engineer

RunPod

Site Reliability Engineer ensuring stability and resilience of Runpod's AI systems platform. Collaborating with engineering teams to improve observability and prevent incidents.

Posted 4/21/2026full-timeRemote • 🇺🇸 United StatesMid-LevelSenior💰 $150,000 - $200,000 per yearWebsite

Tech Stack

Tools & technologies
Distributed SystemsGoGrafanaLinuxPrometheusPython

About the role

Key responsibilities & impact
  • Increase platform uptime and reduce incident frequency and duration
  • Establish and operationalize SLIs/SLOs across services
  • Improve MTTR through better tooling, automation, and runbooks
  • Strengthen production readiness standards
  • Drive long-term systemic reliability improvements
  • Define and implement SLIs/SLOs for critical services
  • Lead incident response and coordinate cross-team mitigation efforts
  • Conduct blameless postmortems and ensure corrective actions are completed
  • Perform production readiness reviews for new services and features
  • Identify systemic risks and drive preventative improvements
  • Design and improve monitoring, alerting, and dashboards (Prometheus, Grafana, etc.)
  • Improve signal-to-noise ratio in alerts and reduce alert fatigue
  • Build internal tooling for reliability tracking and reporting
  • Improve visibility into GPU performance and distributed systems health
  • Automate recurring operational workflows
  • Build tools and scripts (Python, Go, Bash) to eliminate manual processes
  • Improve deployment safety through automation and guardrails
  • Strengthen CI/CD reliability and release processes
  • Partner with engineering teams to improve system resilience
  • Provide guidance on fault tolerance, scalability, and failure handling
  • Contribute to architectural discussions with a reliability-first mindset

Requirements

What you’ll need
  • 5+ years of experience in SRE, Reliability Engineering, or Production Engineering
  • Strong Linux systems and Networking expertise
  • Experience managing containerized production systems
  • Strong understanding of distributed systems and failure modes
  • Experience defining and managing SLIs/SLOs
  • Proven incident response and postmortem leadership experience
  • Strong scripting or programming skills
  • Experience with monitoring and alerting systems
  • Excellent written communication skills
  • Successful completion of a background check

Benefits

Comp & perks
  • Meaningful equity in a fast-growing company
  • Generous medical, dental & vision plans
  • Flexible PTO- take the time you need to recharge
  • Most roles are remote work first with an inclusive, collaborative teams utilizing slack as the main form of internal communication
  • Join a passionate team on the cutting edge of AI infrastructure — where culture, learning, and ownership are at the heart of how we scale.

ATS Keywords

✓ Tailor your resume
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
SLIsSLOsMTTRmonitoringalertingPythonGoBashCI/CDcontainerized systems
Soft Skills
incident responsepostmortem leadershipcommunicationcollaborationproblem-solvingsystemic risk identificationpreventative improvementsguidancearchitectural discussionsreliability mindset