Salary
💰 $140,000 - $180,000 per year
Tech Stack
AnsibleAWSChefCloudGoGrafanaJavaKubernetesLinuxPrometheusPythonSplunkTerraform
About the role
- Responsible for availability, latency, performance, efficiency, monitoring/observability, emergency response, and capacity planning
- Set and maintain SLOs, SLIs, Error Budgets and create dashboards
- Analyze, troubleshoot and resolve operational challenges contributing to defined SLOs
- Manage site stability, performance, reliability, and maintain uptime for production environments
- Develop a fully automated multi-environment observability stack and extend it to predict capacity needs
- Automate to reduce toil and increase development velocity
- Provide application-specific production support, incident management, change management, problem management, RCAs, and service restoration
- Identify architecture changes for reliability, performance, and availability using a data-driven approach
- Document run books and standard operating procedures
- Collaborate with software development teams on release management and operational readiness
- Implement reliability and observability tools (New Relic, Prometheus, Grafana, etc.)
Requirements
- Strong background as a SRE supporting a 24x7 highly available production environment for a SaaS or cloud service provider
- Strong experience with AWS and Infrastructure as Code (Terraform, CloudFormation)
- Understanding of High Availability best practices in AWS
- Solid experience with Monitoring/APM/Observability tools (Splunk, New Relic)
- Experience with Prometheus and Grafana; implementing observability plans around logs, metrics, and traces
- Extensive experience with Kubernetes, Helm, CI/CD and config management tools like Ansible, Chef
- Experience with release automation, system administration, and configuration management
- Programming experience in Java, Python, Go (or similar)
- Scripting experience with Bash and PowerShell
- Strong understanding of Linux, Windows, software development, systems, networking, and cloud concepts
- Experience with SLOs, SLIs, Error Budgets, dashboards, incident management, RCA, and change management