Apply

Ready to go for it?

AI Apply speeds things up—apply directly if you prefer.

FREE ACCESS
5,000–10,000 jobs/day
JobTailor Logo

See all jobs on JobTailor

Search thousands of fresh jobs every day.

Discover
  • Fresh listings
  • Fast filters
  • No subscription required
Create a free account and start exploring right away.
RunPod

Site Reliability Engineer

RunPod

. Increase platform uptime and reduce incident frequency and duration .

Posted 4/21/2026full-timeRemote • 🇺🇸 United StatesMid-LevelSenior💰 $150,000 - $200,000 per yearWebsite

Tech Stack

Tools & technologies
Distributed SystemsGoGrafanaLinuxPrometheusPython

About the role

Key responsibilities & impact
  • Increase platform uptime and reduce incident frequency and duration
  • Establish and operationalize SLIs/SLOs across services
  • Improve MTTR through better tooling, automation, and runbooks
  • Strengthen production readiness standards
  • Drive long-term systemic reliability improvements
  • Define and implement SLIs/SLOs for critical services
  • Lead incident response and coordinate cross-team mitigation efforts
  • Conduct blameless postmortems and ensure corrective actions are completed
  • Perform production readiness reviews for new services and features
  • Identify systemic risks and drive preventative improvements
  • Design and improve monitoring, alerting, and dashboards (Prometheus, Grafana, etc.)
  • Improve signal-to-noise ratio in alerts and reduce alert fatigue
  • Build internal tooling for reliability tracking and reporting
  • Improve visibility into GPU performance and distributed systems health
  • Automate recurring operational workflows
  • Build tools and scripts (Python, Go, Bash) to eliminate manual processes
  • Improve deployment safety through automation and guardrails
  • Strengthen CI/CD reliability and release processes
  • Partner with engineering teams to improve system resilience
  • Provide guidance on fault tolerance, scalability, and failure handling
  • Contribute to architectural discussions with a reliability-first mindset

Requirements

What you’ll need
  • 5+ years of experience in SRE, Reliability Engineering, or Production Engineering
  • Strong Linux systems and Networking expertise
  • Experience managing containerized production systems
  • Strong understanding of distributed systems and failure modes
  • Experience defining and managing SLIs/SLOs
  • Proven incident response and postmortem leadership experience
  • Strong scripting or programming skills
  • Experience with monitoring and alerting systems
  • Excellent written communication skills
  • Successful completion of a background check

Benefits

Comp & perks
  • Meaningful equity in a fast-growing company
  • Generous medical, dental & vision plans
  • Flexible PTO- take the time you need to recharge
  • Most roles are remote work first with an inclusive, collaborative teams utilizing slack as the main form of internal communication
  • Join a passionate team on the cutting edge of AI infrastructure — where culture, learning, and ownership are at the heart of how we scale.

ATS Keywords

✓ Tailor your resume
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
SLIsSLOsMTTRmonitoringalertingPythonGoBashCI/CDcontainerized systems
Soft Skills
incident responsepostmortem leadershipcommunicationcollaborationproblem-solvingsystemic risk identificationpreventative improvementsguidancearchitectural discussionsreliability mindset