Staff Site Reliability Engineer

Lyrebird Health

full-time

Posted on: 1/21/2026

Location Type: Remote

Location: United Kingdom

Visit company website

Explore more

DevOps Engineer jobs

✨ AI Apply

Apply

Job Level

Lead

Tech Stack

AWS Cloud Distributed Systems EC2 Go Python Terraform TypeScript

About the role

Own reliability outcomes across core services and customer-facing systems
Define, implement, and evolve SLOs/SLIs, alerting strategy, and error budgets
Lead initiatives to improve uptime, latency, and overall system resilience
Proactively identify reliability risks and drive mitigation plans to completion
Improve end-to-end observability (metrics, logs, traces) so issues are detected early and diagnosed quickly
Lead incident response for high-severity events and guide teams through calm, effective mitigation
Drive post-incident reviews that result in measurable, lasting improvements
Build a culture of operational excellence: fewer incidents, faster recovery, better learning
Develop internal tooling and paved paths that make “doing the right thing” the easiest option
Improve the developer experience around deployments, rollbacks, environment consistency, and service ownership
Partner with engineers to uplift production-readiness across new and existing services
Improve infrastructure reliability and maintainability using Infrastructure as Code
Strengthen deployment workflows and reduce operational toil through automation
Help shape architecture decisions with a reliability and scalability lens
Embed security and compliance principles into platform practices (access controls, auditability, safe-by-default designs)
Work closely with Security and Engineering leadership to support regulatory and enterprise requirements without slowing down delivery

Requirements

8+ years of engineering experience, with significant depth in SRE / platform/production systems
Strong experience operating and improving systems in production (including incident response)
Proven ability to lead cross-team initiatives and influence engineering standards
Technical StrengthYou don’t need to tick every box, but you should be strong across most: **Cloud/Infrastructure, **AWS (ECS, EC2, VPC, IAM, RDS/Aurora, S3, CloudWatch)
Infrastructure as Code (Terraform)
**Observability
Strong grasp of monitoring and alerting principles
Experience with logs + metrics + tracing and building meaningful dashboards
Familiar with OpenTelemetry and modern observability tooling
**Systems & Operational Excellence
Knowledge of reliability patterns: graceful degradation, retries, backoff, timeouts, load shedding, capacity planning
Strong debugging instincts across distributed systems
Practical approach to risk management and tradeoffs
**Software Engineering
Ability to build tools and automation (TypeScript, Go, Python, or similar)
Familiarity with CI/CD and safe rollout strategies (feature flags, canary, blue/green)

Benefits

8+ years of engineering experience, with significant depth in SRE / platform/production systems
Strong experience operating and improving systems in production (including incident response)
Proven ability to lead cross-team initiatives and influence engineering standards
Technical StrengthYou don’t need to tick every box, but you should be strong across most: **Cloud/Infrastructure, **AWS (ECS, EC2, VPC, IAM, RDS/Aurora, S3, CloudWatch)
Infrastructure as Code (Terraform)
**Observability
Strong grasp of monitoring and alerting principles
Experience with logs + metrics + tracing and building meaningful dashboards
Familiar with OpenTelemetry and modern observability tooling
**Systems & Operational Excellence
Knowledge of reliability patterns: graceful degradation, retries, backoff, timeouts, load shedding, capacity planning
Strong debugging instincts across distributed systems
Practical approach to risk management and tradeoffs
**Software Engineering
Ability to build tools and automation (TypeScript, Go, Python, or similar)
Familiarity with CI/CD and safe rollout strategies (feature flags, canary, blue/green)

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

SREInfrastructure as CodeAWSTerraformObservabilityMonitoringAlertingTypeScriptGoPython

Soft Skills

leadershipcross-team initiativesinfluence engineering standardsrisk managementdebugging instinctsoperational excellenceincident responsecommunicationcollaborationproblem-solving