Design, implement, and manage SLOs, SLIs, and error budgets, ensuring reliability aligns with user expectations and business objectives.
Develop production-grade software to enhance system reliability and reduce manual toil through automation.
Implement and optimize observability solutions using tools like OpenTelemetry, with a focus on high-cardinality metrics, distributed tracing, and actionable insights.
Drive postmortem processes and lead in-depth root cause analyses for incidents, ensuring lessons learned are effectively applied to prevent recurrence.
Define and monitor MTTx metrics (MTTA, MTTR, MTTF), using them to guide system improvements and measure reliability progress.
Design and participate in Chaos Engineering exercises.
Collaborate with engineering teams to design systems with reliability and scalability in mind, incorporating capacity planning, resiliency patterns, and modern deployment strategies (e.g., Canary, Blue-Green).
Lead design reviews for alerting strategies, ensuring effective signal-to-noise ratios in monitoring and incident management.
Advocate for and implement best practices in incident response and system design to achieve optimal uptime and performance.

Requirements

4+ years of experience as an SRE or in a similar role with hands-on coding.
3+ years of software development experience in Python or Golang, with a focus on building maintainable, production-quality code.
Deep understanding of SRE principles, particularly SLIs, SLOs, error budgets, and their real-world application.
Hands-on experience conducting postmortems and implementing observability at scale.
Hands-on experience conducting chaos engineering exercises.
Expertise in designing and implementing end-to-end observability solutions using tools like OpenTelemetry, Prometheus, Grafana, or Honeycomb.
Experience with distributed tracing and handling high-cardinality metrics in production environments.
3+ years of experience with AWS and proficiency in Kubernetes, Terraform, and Infrastructure as Code (IaC) tools.
Strong understanding of distributed systems, microservices architectures, and containerization (Docker, Kubernetes).
Hands-on experience with CI/CD platforms (GitOps, Jenkins, ArgoCD) and building automated pipelines.
Familiarity with tools and frameworks for incident management and operational automation.
Knowledge of modern deployment strategies (e.g., Canary, Blue-Green) and resiliency patterns (e.g., circuit breakers, retries).

Benefits

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

PythonGolangOpenTelemetryPrometheusGrafanaAWSKubernetesTerraformCI/CDInfrastructure as Code

Soft Skills

collaborationleadershipproblem-solvingcommunicationanalytical thinking