Branch

Staff Site Reliability Engineer

Branch

full-time

Posted on:

Location Type: Remote

Location: Remote • Colorado • 🇺🇸 United States

Visit company website
AI Apply
Apply

Job Level

Lead

Tech Stack

AWSCloudDistributed SystemsGoGrafanaJavaKubernetesPrometheusPythonTerraform

About the role

  • Lead the design and execution of an engineering-wide reliability program, ensuring teams adopt SRE principles and best practices
  • Define and champion service ownership standards, partnering with product and platform teams to embed reliability into the development lifecycle
  • Establish and evolve observability practices (metrics, logs, traces), ensuring teams have the tooling and insights to detect, debug, and prevent incidents
  • Partner with engineering leaders to define SLIs, SLOs, and error budgets tied to business outcomes
  • Collaborate with teams to design systems for resilience, scalability, and fault tolerance
  • Provide mentorship and guidance to engineers across the organization
  • Identify opportunities to add automation that increases developer productivity and reduces toil
  • Create standards, frameworks, and runbooks that scale reliability practices across multiple product lines and teams
  • Participate in and improve incident response practices (on-call strategy, SEVs, postmortems, blameless culture)
  • Report on progress, trends, and impact of the reliability program to leaders and stakeholders

Requirements

  • 7+ years of experience in Site Reliability Engineering, Systems Engineering, or related fields (at least 2–3 years in a senior/staff-level role)
  • Strong software engineering skills in one or more languages (e.g., Python, Go, Java)
  • Expertise with cloud infrastructure (AWS preferred) and distributed systems at scale
  • Deep understanding of observability practices (metrics, logs, tracing) and hands-on experience with tools like Datadog, Prometheus, Grafana, or equivalent
  • Strong background in adding automation to increase developer productivity and reduce toil
  • Proven experience defining and rolling out SLIs, SLOs, and error budgets across engineering teams
  • Strong background in incident response, postmortems, and on-call operations
  • Demonstrated ability to influence and mentor engineers across multiple teams
  • Excellent communication skills, with the ability to convey technical concepts and reliability trade-offs to engineers, leadership, and stakeholders
  • Nice to have: Experience with Kubernetes and container orchestration
  • Nice to have: Familiarity with infrastructure-as-code tools (Terraform, CloudFormation, or similar)
  • Nice to have: Knowledge of CI/CD systems and modern release engineering practices
  • Nice to have: Prior experience building or leading an organization-wide reliability program
  • Nice to have: Familiarity with security and compliance considerations for large-scale platforms
Benefits
  • Comprehensive benefits package
  • Health and wellness programs
  • Paid time off
  • Retirement planning options
  • Potential equity for qualifying positions
  • Remote work (Remote - CO)

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills
Site Reliability EngineeringSystems EngineeringPythonGoJavacloud infrastructuredistributed systemsobservability practicesautomationincident response
Soft skills
mentorshipinfluencecommunication
Cognyte

Telecom Deployment Engineer

Cognyte
Mid · Seniorfull-time$100k–$120k / year🇺🇸 United States
Posted: 3 hours agoSource: www.comeet.com
Cognyte

Telecom Deployment Engineer

Cognyte
Mid · Seniorfull-time$100k–$120k / year🇺🇸 United States
Posted: 3 hours agoSource: www.comeet.com
Mission Box Solutions

DevOps Engineer

Mission Box Solutions
Mid · Seniorfull-timeNew York · 🇺🇸 United States
Posted: 17 hours agoSource: jobs.smartrecruiters.com
Cutsforth Inc.

DevOps Engineer

Cutsforth Inc.
Mid · Seniorfull-time$103k–$148k / year🇺🇸 United States
Posted: 18 hours agoSource: cutsforth.applytojob.com
CloudCyber SecurityTerraform