
Site Reliability Engineer
Heidi Health
full-time
Posted on:
Location Type: Hybrid
Location: San Francisco • California • United States
Visit company websiteExplore more
Salary
💰 $140,000 - $185,000 per year
About the role
- Participate in on-call and incident response:
- Respond to production incidents, contribute to service restoration, and support clear communication during incidents.
- Improve operational reliability:
- Identify recurring issues and reliability risks, and drive fixes through better alerting, automation, system changes, or process improvements.
- Own parts of the production environment:
- Operate and improve Kubernetes clusters, cloud infrastructure, and core platform services.
- Strengthen observability:
- Improve dashboards, alerts, logs, and traces so issues are detected earlier.
- Reduce operational toil:
- Automate repetitive tasks, simplify runbooks, and improve tooling for day-to-day operations.
- Support safe change:
- Improve deployments, rollback mechanisms, and operational readiness.
- Contribute to operational practices:
- Write and maintain runbooks, participate in blameless post-mortems.
- Collaborate closely with engineers:
- Work with product and feature teams to improve production readiness.
Requirements
- 3–6+ years in SRE, DevOps, Platform, or operations-heavy engineering roles.
- Experience supporting production systems and participating in on-call rotations.
- Comfortable debugging live systems under pressure.
- Experience operating cloud infrastructure (AWS preferred).
- Working knowledge of Kubernetes and containerised workloads.
- Infrastructure as Code experience (Terraform or similar).
- Familiarity with monitoring and alerting tools (Datadog, Prometheus, etc).
- Scripting or automation experience (Python, Bash, or similar).
Benefits
- Healthcare, Dental, Vision benefit options
- 401k with 3% match
- Personal development budget of $500 per annum
- Become an owner, with shares (equity) in the company
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
KubernetesAWSInfrastructure as CodeTerraformPythonBashmonitoring toolsalerting toolsDatadogPrometheus
Soft Skills
incident responsecommunicationcollaborationdebuggingproblem-solvingprocess improvementoperational readinessautomationblameless post-mortemsproduction readiness