Define and drive the long-term strategy for observability, operational intelligence, and reliability engineering across the organization, aligning technical direction with business growth, customer experience, and service-level objectives.
Lead the evolution toward intelligent operations by designing capabilities such as event correlation, anomaly detection, alert noise reduction, predictive signal detection, and automated remediation to improve MTTD, MTTR, and operational efficiency.
Architect and lead the end-to-end observability platform across metrics, logs, traces, and events. Establish scalable telemetry standards, instrumentation patterns, and onboarding models that enable consistent visibility across AWS and cloud-native services.
Drive large-scale automation initiatives that reduce operational toil, including self-service infrastructure workflows, policy-as-code guardrails, reliability automation, and automated response for common failure scenarios.
Partner with product, platform, and data teams to embed reliability, performance, cost efficiency, and fault tolerance into system design. Lead capacity modeling, resilience planning, and architecture improvements for multi-AZ and multi-region environments.
Provide technical leadership during high-severity incidents and guide blameless postmortems that identify systemic issues and drive long-term reliability improvements.
Define and standardize SLO/SLI frameworks, error budget practices, telemetry conventions, and infrastructure patterns to ensure consistent operational excellence across teams.
Evaluate and introduce emerging AWS-native, cloud-native, and AI-enabled observability and automation technologies. Lead proofs-of-concept and guide organization-wide adoption.
Mentor Staff and Senior SREs, raising the bar for system design, operational rigor, and engineering judgment while fostering a culture of ownership, learning, and continuous improvement.
Act as a senior technical authority for reliability and observability, shaping engineering roadmaps and influencing architectural decisions across product and platform domains.

Requirements

8–10+ years of experience in SRE, platform engineering, or cloud infrastructure roles supporting large-scale production environments.
Demonstrated experience leading architecture, reliability strategy, or operational platforms across multiple teams or organizational domains.
Proven track record operating in 24/7 production environments, including incident leadership, postmortem practices, and proactive reliability management.
Deep expertise designing and operating large-scale AWS environments, including services such as VPC, EC2, EKS/ECS, RDS/DynamoDB, S3, ALB/NLB, IAM, KMS, Route 53, and multi-account architectures.
Experience designing resilient, fault-tolerant systems using multi-AZ/multi-region patterns, graceful degradation, rate limiting, and capacity management.
Senior-level experience with observability platforms (metrics, logs, traces, events) such as New Relic, Datadog, Prometheus/Grafana, OpenTelemetry , or similar.
Experience defining telemetry standards, instrumentation strategies, centralized dashboards, and low-noise alerting practices.
Experience improving operational signal quality through correlation, noise reduction, or advanced analytics.
(Preferred) Experience implementing or evaluating AIOps capabilities such as anomaly detection, event correlation, predictive alerting, automated remediation, or AI-assisted incident analysis.
Familiarity with applying machine learning or AI techniques to operational data, incident trends, or reliability workflows.
Expert-level experience with Infrastructure-as-Code using Terraform and/or CloudFormation, including reusable modules, GitOps workflows, and policy-as-code guardrails.
Strong scripting or programming skills (Python, Go, Bash, or similar) for automation and operational tooling.
Expert understanding of Linux systems, networking (TCP/IP, DNS, TLS), and distributed system behavior.
Expert with Kubernetes and cloud-native architecture patterns.
Demonstrated ability to influence technical direction without direct authority.
Experience mentoring senior engineers and setting organization-wide engineering standards.
Ability to operate effectively in complex, high-impact environments and drive initiatives from concept through adoption.

Benefits

Medical, dental and vision insurance
Health Savings Account
Flexible Spending Accounts
Telehealth
401(k) and 401(k) match
Life and AD&D insurance
Short-Term and Long-Term Disability
FTO or PTO
Employee Well-Being program
11 paid holidays plus 1 inclusive holiday per year
Volunteer Time Off
Employee Referral program
Education Reimbursement Program
Employee Recognition and Appreciation program

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

SREplatform engineeringcloud infrastructureAWSInfrastructure-as-CodeTerraformCloudFormationKubernetesobservability platformsmachine learning

Soft Skills

technical leadershipmentoringinfluencingoperational rigorcontinuous improvementcapacity modelingresilience planningincident leadershipblameless postmortemscollaboration