FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.
Tech Stack
Tools & technologiesDistributed SystemsGoGrafanaLinuxPrometheusPython
About the role
Key responsibilities & impact- Increase platform uptime and reduce incident frequency and duration
- Establish and operationalize SLIs/SLOs across services
- Improve MTTR through better tooling, automation, and runbooks
- Strengthen production readiness standards
- Drive long-term systemic reliability improvements
- Define and implement SLIs/SLOs for critical services
- Lead incident response and coordinate cross-team mitigation efforts
- Conduct blameless postmortems and ensure corrective actions are completed
- Perform production readiness reviews for new services and features
- Identify systemic risks and drive preventative improvements
- Design and improve monitoring, alerting, and dashboards (Prometheus, Grafana, etc.)
- Improve signal-to-noise ratio in alerts and reduce alert fatigue
- Build internal tooling for reliability tracking and reporting
- Improve visibility into GPU performance and distributed systems health
- Automate recurring operational workflows
- Build tools and scripts (Python, Go, Bash) to eliminate manual processes
- Improve deployment safety through automation and guardrails
- Strengthen CI/CD reliability and release processes
- Partner with engineering teams to improve system resilience
- Provide guidance on fault tolerance, scalability, and failure handling
- Contribute to architectural discussions with a reliability-first mindset
Requirements
What you’ll need- 5+ years of experience in SRE, Reliability Engineering, or Production Engineering
- Strong Linux systems and Networking expertise
- Experience managing containerized production systems
- Strong understanding of distributed systems and failure modes
- Experience defining and managing SLIs/SLOs
- Proven incident response and postmortem leadership experience
- Strong scripting or programming skills
- Experience with monitoring and alerting systems
- Excellent written communication skills
- Successful completion of a background check
Benefits
Comp & perks- Meaningful equity in a fast-growing company
- Generous medical, dental & vision plans
- Flexible PTO- take the time you need to recharge
- Most roles are remote work first with an inclusive, collaborative teams utilizing slack as the main form of internal communication
- Join a passionate team on the cutting edge of AI infrastructure — where culture, learning, and ownership are at the heart of how we scale.
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
SLIsSLOsMTTRmonitoringalertingPythonGoBashCI/CDcontainerized systems
Soft Skills
incident responsepostmortem leadershipcommunicationcollaborationproblem-solvingsystemic risk identificationpreventative improvementsguidancearchitectural discussionsreliability mindset
