
Staff Site Reliability Engineer
Lyrebird Health
full-time
Posted on:
Location Type: Remote
Location: United Kingdom
Visit company websiteExplore more
Job Level
About the role
- Own reliability outcomes across core services and customer-facing systems
- Define, implement, and evolve SLOs/SLIs, alerting strategy, and error budgets
- Lead initiatives to improve uptime, latency, and overall system resilience
- Proactively identify reliability risks and drive mitigation plans to completion
- Improve end-to-end observability (metrics, logs, traces) so issues are detected early and diagnosed quickly
- Lead incident response for high-severity events and guide teams through calm, effective mitigation
- Drive post-incident reviews that result in measurable, lasting improvements
- Build a culture of operational excellence: fewer incidents, faster recovery, better learning
- Develop internal tooling and paved paths that make “doing the right thing” the easiest option
- Improve the developer experience around deployments, rollbacks, environment consistency, and service ownership
- Partner with engineers to uplift production-readiness across new and existing services
- Improve infrastructure reliability and maintainability using Infrastructure as Code
- Strengthen deployment workflows and reduce operational toil through automation
- Help shape architecture decisions with a reliability and scalability lens
- Embed security and compliance principles into platform practices (access controls, auditability, safe-by-default designs)
- Work closely with Security and Engineering leadership to support regulatory and enterprise requirements without slowing down delivery
Requirements
- 8+ years of engineering experience, with significant depth in SRE / platform/production systems
- Strong experience operating and improving systems in production (including incident response)
- Proven ability to lead cross-team initiatives and influence engineering standards
- Technical StrengthYou don’t need to tick every box, but you should be strong across most: **Cloud/Infrastructure, **AWS (ECS, EC2, VPC, IAM, RDS/Aurora, S3, CloudWatch)
- Infrastructure as Code (Terraform)
- **Observability
- Strong grasp of monitoring and alerting principles
- Experience with logs + metrics + tracing and building meaningful dashboards
- Familiar with OpenTelemetry and modern observability tooling
- **Systems & Operational Excellence
- Knowledge of reliability patterns: graceful degradation, retries, backoff, timeouts, load shedding, capacity planning
- Strong debugging instincts across distributed systems
- Practical approach to risk management and tradeoffs
- **Software Engineering
- Ability to build tools and automation (TypeScript, Go, Python, or similar)
- Familiarity with CI/CD and safe rollout strategies (feature flags, canary, blue/green)
Benefits
- 8+ years of engineering experience, with significant depth in SRE / platform/production systems
- Strong experience operating and improving systems in production (including incident response)
- Proven ability to lead cross-team initiatives and influence engineering standards
- Technical StrengthYou don’t need to tick every box, but you should be strong across most: **Cloud/Infrastructure, **AWS (ECS, EC2, VPC, IAM, RDS/Aurora, S3, CloudWatch)
- Infrastructure as Code (Terraform)
- **Observability
- Strong grasp of monitoring and alerting principles
- Experience with logs + metrics + tracing and building meaningful dashboards
- Familiar with OpenTelemetry and modern observability tooling
- **Systems & Operational Excellence
- Knowledge of reliability patterns: graceful degradation, retries, backoff, timeouts, load shedding, capacity planning
- Strong debugging instincts across distributed systems
- Practical approach to risk management and tradeoffs
- **Software Engineering
- Ability to build tools and automation (TypeScript, Go, Python, or similar)
- Familiarity with CI/CD and safe rollout strategies (feature flags, canary, blue/green)
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
SREInfrastructure as CodeAWSTerraformObservabilityMonitoringAlertingTypeScriptGoPython
Soft Skills
leadershipcross-team initiativesinfluence engineering standardsrisk managementdebugging instinctsoperational excellenceincident responsecommunicationcollaborationproblem-solving