Participate in on-call and incident response:
Respond to production incidents, contribute to service restoration, and support clear communication during incidents.
Improve operational reliability:
Identify recurring issues and reliability risks, and drive fixes through better alerting, automation, system changes, or process improvements.
Own parts of the production environment:
Operate and improve Kubernetes clusters, cloud infrastructure, and core platform services.
Strengthen observability:
Improve dashboards, alerts, logs, and traces so issues are detected earlier.
Reduce operational toil:
Automate repetitive tasks, simplify runbooks, and improve tooling for day-to-day operations.
Support safe change:
Improve deployments, rollback mechanisms, and operational readiness.
Contribute to operational practices:
Write and maintain runbooks, participate in blameless post-mortems.
Collaborate closely with engineers:
Work with product and feature teams to improve production readiness.

Requirements

3–6+ years in SRE, DevOps, Platform, or operations-heavy engineering roles.
Experience supporting production systems and participating in on-call rotations.
Comfortable debugging live systems under pressure.
Experience operating cloud infrastructure (AWS preferred).
Working knowledge of Kubernetes and containerised workloads.
Infrastructure as Code experience (Terraform or similar).
Familiarity with monitoring and alerting tools (Datadog, Prometheus, etc).
Scripting or automation experience (Python, Bash, or similar).

Benefits

Healthcare, Dental, Vision benefit options
401k with 3% match
Personal development budget of $500 per annum
Become an owner, with shares (equity) in the company

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

KubernetesAWSInfrastructure as CodeTerraformPythonBashmonitoring toolsalerting toolsDatadogPrometheus

Soft Skills

incident responsecommunicationcollaborationdebuggingproblem-solvingprocess improvementoperational readinessautomationblameless post-mortemsproduction readiness