
Senior Site Reliability Engineer
Zeta Global
full-time
Posted on:
Location Type: Remote
Location: United States
Visit company websiteExplore more
Salary
💰 $140,000 - $170,000 per year
Job Level
Tech Stack
About the role
- Design, implement, and manage SLOs, SLIs, and error budgets, ensuring reliability aligns with user expectations and business objectives.
- Develop production-grade software to enhance system reliability and reduce manual toil through automation.
- Implement and optimize observability solutions using tools like OpenTelemetry, with a focus on high-cardinality metrics, distributed tracing, and actionable insights.
- Drive postmortem processes and lead in-depth root cause analyses for incidents, ensuring lessons learned are effectively applied to prevent recurrence.
- Define and monitor MTTx metrics (MTTA, MTTR, MTTF), using them to guide system improvements and measure reliability progress.
- Design and participate in Chaos Engineering exercises.
- Collaborate with engineering teams to design systems with reliability and scalability in mind, incorporating capacity planning, resiliency patterns, and modern deployment strategies (e.g., Canary, Blue-Green).
- Lead design reviews for alerting strategies, ensuring effective signal-to-noise ratios in monitoring and incident management.
- Advocate for and implement best practices in incident response and system design to achieve optimal uptime and performance.
Requirements
- 4+ years of experience as an SRE or in a similar role with hands-on coding.
- 3+ years of software development experience in Python or Golang, with a focus on building maintainable, production-quality code.
- Deep understanding of SRE principles, particularly SLIs, SLOs, error budgets, and their real-world application.
- Hands-on experience conducting postmortems and implementing observability at scale.
- Hands-on experience conducting chaos engineering exercises.
- Expertise in designing and implementing end-to-end observability solutions using tools like OpenTelemetry, Prometheus, Grafana, or Honeycomb.
- Experience with distributed tracing and handling high-cardinality metrics in production environments.
- 3+ years of experience with AWS and proficiency in Kubernetes, Terraform, and Infrastructure as Code (IaC) tools.
- Strong understanding of distributed systems, microservices architectures, and containerization (Docker, Kubernetes).
- Hands-on experience with CI/CD platforms (GitOps, Jenkins, ArgoCD) and building automated pipelines.
- Familiarity with tools and frameworks for incident management and operational automation.
- Knowledge of modern deployment strategies (e.g., Canary, Blue-Green) and resiliency patterns (e.g., circuit breakers, retries).
Benefits
- Unlimited PTO
- Excellent medical, dental, and vision coverage
- Employee Equity and Stock Purchase Plan
- Employee Discounts, Virtual Wellness Classes, and Pet Insurance And more!!
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
PythonGolangOpenTelemetryPrometheusGrafanaAWSKubernetesTerraformCI/CDInfrastructure as Code
Soft Skills
collaborationleadershipproblem-solvingcommunicationanalytical thinking