
Site Reliability Engineer, Azure – DevSecOps – IaC – Governance – Observability
Avaya
full-time
Posted on:
Location Type: Remote
Location: United States
Visit company websiteExplore more
Salary
💰 $129,000 - $143,000 per year
About the role
- Serve as a key member of the 24×7 on-call rotation, responding to and managing incidents across production and pre-production environments.
- Lead incident bridges, coordinate root cause analysis (RCA), and ensure post-incident reviews drive systemic improvements.
- Maintain clear communication with cross-functional teams and leadership during major incidents.
- Build, tune, and maintain observability dashboards (Azure Monitor, GCP Operations Suite, Prometheus, Grafana, Datadog, Log Analytics).
- Perform deep-dive troubleshooting of application and service-level issues using distributed tracing and log analysis (Grafana, Datadog) to pinpoint root causes beyond infrastructure.
- Define SLOs, SLIs, and error budgets to proactively identify and mitigate reliability risks before customer impact.
- Integrate AI-Ops tools for anomaly detection, predictive alerting, and automated incident correlation.
- Continuously enhance alert quality, reduce false positives, and automate runbooks for faster recovery.
Requirements
- 5+ years in Site Reliability, DevOps, Cloud Operations, or Customer support roles.
- Demonstrated experience in application-level troubleshooting by analyzing logs and traces to identify bugs, performance bottlenecks, and error conditions.
- Expertise in Azure and GCP cloud operations and distributed system reliability.
- Understanding of Terraform, Ansible, and CI/CD pipelines (Jenkins, GitHub Actions).
- Experience with observability and AI-Ops tools (Azure Monitor, GCP Operations Suite, Grafana, Prometheus, Datadog, etc.).
- Solid grasp of incident management frameworks (P1–P3 handling, RCA, PIRs, on-call rotations).
- Excellent analytical, troubleshooting, and communication skills.
Benefits
- performance-related bonus
- benefits
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
incident managementapplication-level troubleshootingdistributed tracinglog analysisSLOsSLIserror budgetsTerraformAnsibleCI/CD pipelines
Soft Skills
analytical skillstroubleshooting skillscommunication skills