Senior Site Reliability Engineer

Heidi Health

Senior Site Reliability Engineer supporting production systems for Heidi's AI Care Partner. Focused on incident response, system reliability, and day-to-day operations in a hybrid environment.

Posted 6/5/2026full-time🇮🇪 IrelandSeniorWebsite

ATS Keywords

Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills

KubernetesAWSTerraformPythonBashmonitoring toolsalerting toolsautomationdebuggingproduction systems

Soft Skills

communicationcollaborationproblem-solvingincident responseleadershipprocess improvementoperational readinessblameless post-mortemsreliability expectationsservice ownership

Tools & Technologies

DatadogPrometheuscloud infrastructurerunbooksdashboardsalertslogstracesautomation toolsproduction environment

Industry Keywords

SREDevOpsoperations-heavy engineeringincident response processesoperational reliabilityrecurring issuesreliability risksservice restorationproduction incidentscontainerised workloads

Tech Stack

Tools & technologies

AWSCloudKubernetesPrometheusPythonTerraform

About the role

Key responsibilities & impact

Participate in on-call and incident response: Respond to production incidents, contribute to service restoration, and support clear communication during incidents. Over time, take increasing responsibility for leading incidents end-to-end.
Improve operational reliability: Identify recurring issues and reliability risks, and drive fixes through better alerting, automation, system changes, or process improvements.
Own parts of the production environment: Operate and improve Kubernetes clusters, cloud infrastructure, and core platform services, with growing ownership as familiarity increases.
Strengthen observability: Improve dashboards, alerts, logs, and traces so issues are detected earlier and diagnosed faster, with a strong focus on actionable signals.
Reduce operational toil: Automate repetitive tasks, simplify runbooks, and improve tooling to make on-call and day-to-day operations easier and safer.
Support safe change: Improve deployments, rollback mechanisms, and operational readiness to reduce the risk of incidents caused by change.
Contribute to operational practices: Write and maintain runbooks, participate in blameless post-mortems, and help improve incident response processes over time.
Collaborate closely with engineers: Work with product and feature teams to improve production readiness, service ownership, and reliability expectations.

Requirements

What you’ll need

3–6+ years in SRE, DevOps, Platform, or operations-heavy engineering roles.
Experience supporting production systems and participating in on-call rotations.
Comfortable debugging live systems under pressure.
Experience operating cloud infrastructure (AWS preferred).
Working knowledge of Kubernetes and containerised workloads.
Infrastructure as Code experience (Terraform or similar).
Familiarity with monitoring and alerting tools (Datadog, Prometheus, etc).
Scripting or automation experience (Python, Bash, or similar).

Benefits

Comp & perks

Your health, covered. Comprehensive private medical and dental cover through Bupa, plus 24/7 mental health, coaching and wellbeing support through Sonder and a £100/month Healthy Heidi’s stipend.
Global parental leave. 26 weeks paid for primary carers and 18 weeks for secondary carers, subject to eligibility.
Fertility support. £7,000 one-off payment, eligibility applies.
Learning & development. £700 per year for courses, books, memberships, conferences and more.
Home office budget. £500 one-off to set up a workspace you actually want to work in.
Recharge days after major milestones and busy periods so you can reset and come back strong.
Work from anywhere for up to 4 weeks per year, wherever the world takes you.
Clinical leave. 10 days per year for eligible clinical roles to maintain accreditation and requirements.
Flexibility that works. A hybrid environment, with 3 days in the office.