Zeta Global

Senior Site Reliability Engineer

Zeta Global

full-time

Posted on:

Location Type: Remote

Location: United States

Visit company website

Explore more

AI Apply
Apply

Salary

💰 $140,000 - $170,000 per year

Job Level

About the role

  • Design, implement, and manage SLOs, SLIs, and error budgets, ensuring reliability aligns with user expectations and business objectives.
  • Develop production-grade software to enhance system reliability and reduce manual toil through automation.
  • Implement and optimize observability solutions using tools like OpenTelemetry, with a focus on high-cardinality metrics, distributed tracing, and actionable insights.
  • Drive postmortem processes and lead in-depth root cause analyses for incidents, ensuring lessons learned are effectively applied to prevent recurrence.
  • Define and monitor MTTx metrics (MTTA, MTTR, MTTF), using them to guide system improvements and measure reliability progress.
  • Design and participate in Chaos Engineering exercises.
  • Collaborate with engineering teams to design systems with reliability and scalability in mind, incorporating capacity planning, resiliency patterns, and modern deployment strategies (e.g., Canary, Blue-Green).
  • Lead design reviews for alerting strategies, ensuring effective signal-to-noise ratios in monitoring and incident management.
  • Advocate for and implement best practices in incident response and system design to achieve optimal uptime and performance.

Requirements

  • 4+ years of experience as an SRE or in a similar role with hands-on coding.
  • 3+ years of software development experience in Python or Golang, with a focus on building maintainable, production-quality code.
  • Deep understanding of SRE principles, particularly SLIs, SLOs, error budgets, and their real-world application.
  • Hands-on experience conducting postmortems and implementing observability at scale.
  • Hands-on experience conducting chaos engineering exercises.
  • Expertise in designing and implementing end-to-end observability solutions using tools like OpenTelemetry, Prometheus, Grafana, or Honeycomb.
  • Experience with distributed tracing and handling high-cardinality metrics in production environments.
  • 3+ years of experience with AWS and proficiency in Kubernetes, Terraform, and Infrastructure as Code (IaC) tools.
  • Strong understanding of distributed systems, microservices architectures, and containerization (Docker, Kubernetes).
  • Hands-on experience with CI/CD platforms (GitOps, Jenkins, ArgoCD) and building automated pipelines.
  • Familiarity with tools and frameworks for incident management and operational automation.
  • Knowledge of modern deployment strategies (e.g., Canary, Blue-Green) and resiliency patterns (e.g., circuit breakers, retries).
Benefits
  • Unlimited PTO
  • Excellent medical, dental, and vision coverage
  • Employee Equity and Stock Purchase Plan
  • Employee Discounts, Virtual Wellness Classes, and Pet Insurance And more!!
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
PythonGolangOpenTelemetryPrometheusGrafanaAWSKubernetesTerraformCI/CDInfrastructure as Code
Soft Skills
collaborationleadershipproblem-solvingcommunicationanalytical thinking