Lead the design and execution of an engineering-wide reliability program, ensuring teams adopt SRE principles and best practices
Define and champion service ownership standards, partnering with product and platform teams to embed reliability into the development lifecycle
Establish and evolve observability practices (metrics, logs, traces), ensuring teams have the tooling and insights to detect, debug, and prevent incidents
Partner with engineering leaders to define SLIs, SLOs, and error budgets tied to business outcomes
Collaborate with teams to design systems for resilience, scalability, and fault tolerance
Provide mentorship and guidance to engineers across the organization
Identify opportunities to add automation that increases developer productivity and reduces toil
Create standards, frameworks, and runbooks that scale reliability practices across multiple product lines and teams
Participate in and improve incident response practices (on-call strategy, SEVs, postmortems, blameless culture)
Report on progress, trends, and impact of the reliability program to leaders and stakeholders
Requirements
7+ years of experience in Site Reliability Engineering, Systems Engineering, or related fields (at least 2–3 years in a senior/staff-level role)
Strong software engineering skills in one or more languages (e.g., Python, Go, Java)
Expertise with cloud infrastructure (AWS preferred) and distributed systems at scale
Deep understanding of observability practices (metrics, logs, tracing) and hands-on experience with tools like Datadog, Prometheus, Grafana, or equivalent
Strong background in adding automation to increase developer productivity and reduce toil
Proven experience defining and rolling out SLIs, SLOs, and error budgets across engineering teams
Strong background in incident response, postmortems, and on-call operations
Demonstrated ability to influence and mentor engineers across multiple teams
Excellent communication skills, with the ability to convey technical concepts and reliability trade-offs to engineers, leadership, and stakeholders
Nice to have: Experience with Kubernetes and container orchestration
Nice to have: Familiarity with infrastructure-as-code tools (Terraform, CloudFormation, or similar)
Nice to have: Knowledge of CI/CD systems and modern release engineering practices
Nice to have: Prior experience building or leading an organization-wide reliability program
Nice to have: Familiarity with security and compliance considerations for large-scale platforms
Benefits
Comprehensive benefits package
Health and wellness programs
Paid time off
Retirement planning options
Potential equity for qualifying positions
Remote work (Remote - CO)
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
Site Reliability EngineeringSystems EngineeringPythonGoJavacloud infrastructuredistributed systemsobservability practicesautomationincident response