Salary
💰 $175,000 - $215,000 per year
Tech Stack
AnsibleAWSAzureCloudDistributed SystemsDockerGoGoogle Cloud PlatformGrafanaKubernetesLinuxMicroservicesPrometheusPythonTerraform
About the role
- Build, lead, and mentor a team of SREs across multiple regions and time zones.
- Define the long-term vision and roadmap for SRE, aligning with organizational objectives.
- Partner with product and engineering to ensure reliability is embedded in design, development, and operations.
- Own the end-to-end reliability of critical customer-facing services.
- Establish and maintain SLOs, SLIs, and error budgets to measure and enforce service quality.
- Drive root cause analysis and problem management for major incidents, ensuring long-term fixes are prioritized.
- Champion adoption of ITIL/OSS processes (incident, change, problem, and capacity management).
- Expand automation in deployment, monitoring, testing, and incident response to reduce toil.
- Oversee observability platforms (e.g., Catchpoint, Grafana, Moogsoft/BigPanda, Prometheus, Datadog).
- Ensure robust configuration, capacity, and change management practices.
- Partner with Network Engineering, DevOps, NOC, and Product Engineering on scalable, resilient architecture.
- Support business continuity, disaster recovery, and compliance requirements.
- Engage with vendors and service providers to manage SLAs and performance outcomes.
- Hire, coach, and develop engineers and managers, creating strong career paths within SRE.
- Foster a culture of reliability, accountability, and continuous improvement.
- Lead succession planning and leadership pipeline development.
Requirements
- Bachelor’s degree in Computer Science, Engineering, or related field (Master’s preferred).
- 10+ years in infrastructure, reliability, or operations engineering roles.
- 5+ years in people leadership with experience managing managers and global teams.
- Deep expertise in Linux operating systems (administration, performance tuning, troubleshooting, security hardening).
- Strong knowledge of distributed systems, cloud platforms (AWS, GCP, Azure, private cloud), and networking fundamentals.
- Solid background in observability, monitoring, logging, and alerting frameworks.
- Proficiency with automation (Python, Go, Ansible, Terraform, CI/CD pipelines).
- Familiarity with containers (Kubernetes, Docker) and microservices architectures.
- Strong understanding of ITIL/OSS frameworks, SLO/error budget practices, and incident management at scale.
- Proven ability to manage large-scale, high-availability environments.
- Strong communication skills with executive presence; able to translate technical topics into business outcomes.
- Demonstrated success in building and maturing high-performing SRE/operations teams.
- Preferred: Experience in a service provider, CDN, or large-scale SaaS environment.
- Preferred: Familiarity with compliance and regulatory frameworks (SOC 2, ISO 27001, GDPR).
- Preferred: Track record of driving cultural transformation toward reliability-first principles.