Lead cross-functional reliability initiatives across multiple value streams and coordinate execution across teams.
Define and evolve SRE best practices, tools, and methodologies across the organization.
Architect enterprise-scale, multi-region AWS infrastructure that balances reliability, cost, performance, and security.
Establish and operate SLOs, SLIs, and error budgets for critical services, using them to drive prioritization decisions.
Serve as incident commander for major incidents and drive postmortems that produce completed action items and organizational learning.
Lead disaster recovery planning for critical financial services infrastructure.
Build shared Infrastructure as Code foundations in Terraform (reusable modules, standards, and patterns adopted across teams).
Design and implement production-scale Kubernetes patterns, including multi-tenancy, security policies, and advanced scheduling.
Establish observability standards and strategies using Datadog and Splunk (metrics, logging, tracing, dashboards, and alerting).
Set CI/CD standards and patterns, including pipeline-as-code and progressive delivery at scale.
Lead chaos engineering, game days, and systematic reliability testing initiatives.
Drive FinOps initiatives to optimize cloud spend while maintaining reliability targets.
Lead a functional team of SREs (without direct reports) on projects and operational initiatives.
Mentor SREs at multiple levels through coaching, design reviews, code reviews, and training sessions.
Partner with Engineering, Product, and Security leadership to align reliability work with business priorities, zero-trust architecture, and compliance controls.

Requirements

Bachelor’s degree in Computer Science, Information Technology, or related field (or equivalent practical experience).
7 to 10 years of Site Reliability Engineering experience (or equivalent), with demonstrated technical leadership.
Proven ability to lead technical teams and drive complex projects to completion.
Expert AWS knowledge, including designing large-scale, multi-region architectures.
Deep Kubernetes expertise, including advanced features, security, and production-scale operations.
Mastery of Infrastructure as Code using Terraform, including building shared platforms and frameworks.
Strong software engineering background with production experience in Python and/or Go.
Extensive experience with observability platforms (Datadog, Splunk) and implementing monitoring at scale.
Deep understanding of CI/CD principles and experience implementing enterprise-grade pipelines.
Proven track record leading major incidents and conducting effective postmortems.
Strong understanding of security, networking, and infrastructure design patterns.
Strong communication skills with ability to explain complex technical concepts to diverse audiences.
Experience mentoring engineers and building technical capabilities in teams.

Benefits

Medical, dental, vision and life insurance
Retirement savings – 401(k) plan with generous company matching contributions (up to 6%), financial advisory services, potential company discretionary contribution, and a broad investment lineup
Tuition reimbursement up to $5,250/year
Business-casual environment that includes the option to wear jeans
Generous paid time off upon hire – including a paid time off program plus ten paid company holidays and three floating holidays each calendar year
Paid volunteer time — 16 hours per calendar year
Leave of absence programs – including paid parental leave, paid short- and long-term disability, and Family and Medical Leave (FMLA)
Business Resource Groups (BRGs) – BRGs facilitate inclusion and collaboration across our business internally and throughout the communities where we live, work and play. BRGs are open to all.

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

Site Reliability EngineeringAWSKubernetesTerraformPythonGoCI/CDobservabilityFinOpschaos engineering

Soft Skills

technical leadershipcommunicationmentoringproject managementcollaboration