Leading and executing on key reliability initiatives from planning to delivery, particularly focusing on monitoring, alerting, and incident response for the Firefighters team.
Monitoring and Alerting Setup: Configuring comprehensive alerting systems, including queue monitoring and service health checks.
Metrics and Dashboards: Building performance dashboards, implementing load testing, and creating capacity metrics for presentations.
Observability Enhancement: Implementing end-to-end traceability with distributed tracing and service profiling.
Infrastructure Automation: Working on pipeline improvements, moving to strength-based pipelines.
Datadog Integration: Continuing the migration back to Datadog and optimizing our monitoring stack.
Collaborating with cross-functional teams to deliver scalable solutions, optimize processes, and implement highly reliable systems.
Taking ownership of complex SRE tasks, including configuring monitoring systems, defining and enforcing SLIs, SLOs, and SLAs.
Proposing improvements and helping establish best practices, workflows, and standards for incident response, blameless post-mortems, and continuous improvement.

Requirements

7-10+ years of experience in Site Reliability Engineering, DevOps, or a similar role focused on large-scale distributed systems.
Strong skills in Kubernetes for container orchestration and cluster management.
Extensive experience with AWS as a core cloud platform for infrastructure management.
Critical proficiency with Datadog for monitoring, logging, tracing, and alerting.
Proven experience in designing, implementing, and optimizing CI/CD pipelines, ideally with GitHub Pipelines.
Strong understanding and practical application of SRE principles: SLI/SLO/SLA definition, error budget management, incident response, post-mortem analysis, and toil reduction.
A proactive mindset with the ability to solve complex problems, drive projects independently, and continuously innovate our reliability practices.
Strong communication skills, especially in English, to collaborate effectively across technical teams and stakeholders.

Benefits

Competitive salary tailored to your experience, skills, and expertise.
Equity opportunities so you can share in our growth and success.
Unlimited PTO and flexibility when you need it the most.
Referral bonus. We truly believe we hire fantastic people, and great talent recognizes great talent. We offer a significant bonus for your hired referral.
Yearly learning & development stipend to help you grow and do your best work.

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

Site Reliability EngineeringDevOpsKubernetesAWSDatadogCI/CD pipelinesGitHub Pipelinesmonitoring systemsload testingdistributed tracing

Soft Skills

problem solvingproject managementcommunicationcollaborationproactive mindsetcontinuous improvementownershipinnovationcross-functional teamworkblameless post-mortems