Tech Stack
AzureCloudDistributed SystemsGoGrafanaJavaJavaScriptKubernetesPrometheusTerraformTypeScript
About the role
- Act as a technical authority, mentoring senior engineers and guiding design choices to improve service reliability and resilience
- Lead the definition and enforcement of SLIs, SLOs, and error budgets and drive adherence across engineering teams
- Collaborate with Staff peers and partner with development and product teams to design for failure and operationalize reliability from the start
- Drive company-wide adoption of observability best practices and tooling; ensure metrics, logs, and traces provide deep, actionable insights
- Lead complex incident responses, postmortems, and systemic reliability improvements while promoting a blameless culture
- Lead initiatives in infrastructure as code, deployment automation, and resilience testing; influence chaos engineering and release validation frameworks
- Partner with platform and security teams to ensure production readiness and represent the SRE team in technical leadership forums and product planning
Requirements
- 8+ years of experience in a Software Engineering or SRE role, including technical leadership
- Demonstrated experience mentoring and guiding senior engineers
- Deep expertise in building distributed systems on public cloud (Azure preferred)
- Strong skills in programming (e.g., JS, Go, Typescript, Java, or C#)
- Hands-on experience with observability tooling (e.g., Prometheus, Grafana, OpenTelemetry)
- Mastery of infrastructure automation tools (Terraform, Pulumi) and container orchestration (Kubernetes)
- Ability to communicate clearly across geographies and disciplines
- Experience leading SRE initiatives across multiple product teams (preferred)
- Background in chaos engineering, incident learning, or performance and load testing (preferred)
- Familiarity with global compliance standards (ISO, SOC 2, GDPR, FedRAMP, CMMC) (preferred)