Salary
💰 $140,000 - $160,000 per year
Tech Stack
AWSCloudDistributed SystemsGoGoogle Cloud PlatformJavaScriptKubernetesMicroservicesPrometheusPythonTerraformVault
About the role
- At LifeStance Health, we’re building the future of mental healthcare—and we need a Senior Site Reliability Engineer to architect and safeguard the mission-critical infrastructure behind our national digital health platform.
- This is not just a support role. You’ll be a principal engineer shaping how our platform scales securely and reliably to serve millions.
- Define service-level objectives (SLOs), lead reliability reviews, champion incident response, and ensure production readiness is embedded in our engineering DNA.
- Architect scalable, secure infrastructure on AWS using EKS, Lambda, and edge networking strategies.
- Drive incident response operations, lead postmortems, and institutionalize RCA learnings.
- Automate everything: provisioning, security controls, deployments, chaos, DR drills—using Terraform, Helm, GitHub Actions.
- Build and maintain observability stack (Datadog, Prometheus, ELK, OpenTelemetry); deliver actionable dashboards and alerts.
- Implement and maintain zero-trust IAM and secrets management frameworks (Vault, AWS Secrets Manager).
- Lead platform reliability reviews and collaborate with engineers, security, and compliance teams to harden architecture.
- Mentor engineers, lead production reviews, and evolve the reliability mindset company-wide.
Requirements
- 10+ years in DevOps/SRE/Platform Engineering roles; at least 4+ years architecting for distributed cloud-native systems at scale.
- Expert in AWS core services (EKS, VPC, RDS, Route 53, IAM, Lambda); Terraform-first mindset.
- Proven track record in establishing SLIs/SLOs, building error budgets, and aligning them with business velocity.
- Deep expertise in Kubernetes (EKS), Helm, service meshes (Istio/Linkerd), and microservices orchestration.
- Strong software engineering fundamentals in Python, Go, or similar.
- Hands-on experience with modern observability platforms and real-time monitoring solutions.
- Technical leadership in incident response, risk management, and operational resilience in regulated industries.
- Ability to translate system architecture into platform strategy and influence executive stakeholders.
- Preferred: Certifications: AWS DevOps Pro, GCP SRE/Architect, Certified Kubernetes Administrator (CKA).
- Preferred: Experience with hybrid/multi-cloud systems and edge deployments.
- Preferred: Experience deploying and securing healthcare platforms (HIPAA, FHIR, HL7).
- Preferred: Published thought leadership or open-source contributions in reliability, observability, or infrastructure automation.
- Additional Requirements: Must be legally authorized to be employed in the United States.