Tech Stack
CloudDistributed SystemsOpen Source
About the role
- Guide the strategic reliability roadmap across services, collaborating with Staff SREs and product engineering leadership
- Lead deep dives into service architecture and operational behaviors to identify opportunities for system-wide improvements
- Set architectural guardrails and partner with engineering leadership to design for resilience and scalability at global scale
- Mentor Staff SREs and serve as the connective tissue between their areas of expertise
- Drive alignment across teams and functions through leadership, coaching, and technical authority
- Advocate for a blameless, data-informed culture of continuous improvement
- Lead and evolve chaos engineering, resilience testing, and system validation programs
- Establish visibility mechanisms (dashboards, SLO reporting, scorecards) that track our reliability posture
- Represent the SRE discipline in executive and cross-functional settings, influencing org-level decisions
- Collaborate with platform, security, and infrastructure teams to build shared tooling and processes
- Contribute thought leadership internally and externally through presentations, white papers, and conference talks
Requirements
- 12+ years in engineering roles, including 3+ years in a staff+ or principal-level position
- Expertise in large-scale distributed systems, cloud infrastructure, and resilience strategies
- Proven ability to set technical direction and align diverse teams around architectural goals
- Experience mentoring Staff+ engineers and guiding multi-team initiatives
- Advanced skills in systems architecture, cloud-native development, and automation tooling
- Excellent communicator able to influence technical and non-technical stakeholders alike
- Preferred: Experience building a global SRE practice
- Preferred: Participation in industry forums, open source, or speaking engagements
- Preferred: Deep knowledge of service ownership, SLIs/SLOs, and organizational scaling challenges