Lyrebird Health

Staff Site Reliability Engineer

Lyrebird Health

full-time

Posted on:

Location Type: Remote

Location: United Kingdom

Visit company website

Explore more

AI Apply
Apply

Job Level

About the role

  • Own reliability outcomes across core services and customer-facing systems
  • Define, implement, and evolve SLOs/SLIs, alerting strategy, and error budgets
  • Lead initiatives to improve uptime, latency, and overall system resilience
  • Proactively identify reliability risks and drive mitigation plans to completion
  • Improve end-to-end observability (metrics, logs, traces) so issues are detected early and diagnosed quickly
  • Lead incident response for high-severity events and guide teams through calm, effective mitigation
  • Drive post-incident reviews that result in measurable, lasting improvements
  • Build a culture of operational excellence: fewer incidents, faster recovery, better learning
  • Develop internal tooling and paved paths that make “doing the right thing” the easiest option
  • Improve the developer experience around deployments, rollbacks, environment consistency, and service ownership
  • Partner with engineers to uplift production-readiness across new and existing services
  • Improve infrastructure reliability and maintainability using Infrastructure as Code
  • Strengthen deployment workflows and reduce operational toil through automation
  • Help shape architecture decisions with a reliability and scalability lens
  • Embed security and compliance principles into platform practices (access controls, auditability, safe-by-default designs)
  • Work closely with Security and Engineering leadership to support regulatory and enterprise requirements without slowing down delivery

Requirements

  • 8+ years of engineering experience, with significant depth in SRE / platform/production systems
  • Strong experience operating and improving systems in production (including incident response)
  • Proven ability to lead cross-team initiatives and influence engineering standards
  • Technical StrengthYou don’t need to tick every box, but you should be strong across most: **Cloud/Infrastructure, **AWS (ECS, EC2, VPC, IAM, RDS/Aurora, S3, CloudWatch)
  • Infrastructure as Code (Terraform)
  • **Observability
  • Strong grasp of monitoring and alerting principles
  • Experience with logs + metrics + tracing and building meaningful dashboards
  • Familiar with OpenTelemetry and modern observability tooling
  • **Systems & Operational Excellence
  • Knowledge of reliability patterns: graceful degradation, retries, backoff, timeouts, load shedding, capacity planning
  • Strong debugging instincts across distributed systems
  • Practical approach to risk management and tradeoffs
  • **Software Engineering
  • Ability to build tools and automation (TypeScript, Go, Python, or similar)
  • Familiarity with CI/CD and safe rollout strategies (feature flags, canary, blue/green)
Benefits
  • 8+ years of engineering experience, with significant depth in SRE / platform/production systems
  • Strong experience operating and improving systems in production (including incident response)
  • Proven ability to lead cross-team initiatives and influence engineering standards
  • Technical StrengthYou don’t need to tick every box, but you should be strong across most: **Cloud/Infrastructure, **AWS (ECS, EC2, VPC, IAM, RDS/Aurora, S3, CloudWatch)
  • Infrastructure as Code (Terraform)
  • **Observability
  • Strong grasp of monitoring and alerting principles
  • Experience with logs + metrics + tracing and building meaningful dashboards
  • Familiar with OpenTelemetry and modern observability tooling
  • **Systems & Operational Excellence
  • Knowledge of reliability patterns: graceful degradation, retries, backoff, timeouts, load shedding, capacity planning
  • Strong debugging instincts across distributed systems
  • Practical approach to risk management and tradeoffs
  • **Software Engineering
  • Ability to build tools and automation (TypeScript, Go, Python, or similar)
  • Familiarity with CI/CD and safe rollout strategies (feature flags, canary, blue/green)
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
SREInfrastructure as CodeAWSTerraformObservabilityMonitoringAlertingTypeScriptGoPython
Soft Skills
leadershipcross-team initiativesinfluence engineering standardsrisk managementdebugging instinctsoperational excellenceincident responsecommunicationcollaborationproblem-solving