Apply

Ready to go for it?

AI Apply speeds things up—apply directly if you prefer.

FREE ACCESS
5,000–10,000 jobs/day
JobTailor Logo

See all jobs on JobTailor

Search thousands of fresh jobs every day.

Discover
  • Fresh listings
  • Fast filters
  • No subscription required
Create a free account and start exploring right away.
Boeing

Cloud Reliability Manager

Boeing

. Own strategy, roadmap, and delivery for Runtime SRE and Cloud Operations to meet enterprise Service Level Objectives (SLOs) and operational Service-Level Agreements (SLAs) .

Posted 5/21/2026full-timeSeattle • California, Illinois, Montana, Washington • 🇺🇸 United StatesMid-LevelSenior💰 $161,500 - $233,450 per yearWebsite

Tech Stack

Tools & technologies
CloudElasticSearchGrafanaKubernetesLogstashPrometheusTerraform

About the role

Key responsibilities & impact
  • Own strategy, roadmap, and delivery for Runtime SRE and Cloud Operations to meet enterprise Service Level Objectives (SLOs) and operational Service-Level Agreements (SLAs)
  • Lead, mentor, and grow teams responsible for runtime SRE (SLOs/SLIs, observability, performance engineering, Disaster Recovery (DR), chaos testing) and Cloud Operations
  • Establish and own incident management processes: detection, escalation, incident command, post-incident reviews, and remediation planning; ensure rapid detection and reduced Mean Time to Repair (MTTR)
  • Drive observability and telemetry strategy (metrics, tracing, logs) to ensure actionable alerts and proactive detection of platform issues
  • Lead capacity planning, performance tuning, and disaster recovery orchestration for platform services and multi-cluster fleets
  • Convert Root Cause Analysis (RCA) outcomes into prioritized engineering work
  • Define and measure operational Key Performance Indicator (KPIs) and implement automation to reduce manual toil
  • Own on-call and rotation policies, runbook quality, bridge setup SLAs, and operational playbooks; ensure teams are trained and drills executed regularly
  • Ensure security, compliance, and change management controls are integrated into operational procedures and emergency responses

Requirements

What you’ll need
  • 5+ years in cloud operations, SRE, and/or related roles
  • 3+ years managing technical teams with on-call responsibilities
  • 3+ years of experience with Kubernetes at scale and multi-cloud runtime platforms (EKS/AKS/GKE)
  • 3+ years of experience with observability tooling (Prometheus, Grafana, OpenTelemetry, Elasticsearch, Logstash, Kibana (ELK), Fluentd, Kibana (EFK), tracing) and alerting design
  • Experience owning incident response and improving reliability metrics in production environments
  • Experience with capacity planning, performance engineering, and disaster recovery at cloud scale
  • Experience with automation tooling (Terraform, CI/CD, operators) and integrating reliability into IaC pipelines

Benefits

Comp & perks
  • health insurance
  • flexible spending accounts
  • health savings accounts
  • retirement savings plans
  • life and disability insurance programs
  • paid and unpaid time away from work

ATS Keywords

✓ Tailor your resume
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
cloud operationssite reliability engineering (SRE)incident managementcapacity planningperformance tuningdisaster recovery (DR)root cause analysis (RCA)automationobservabilityalerting design
Soft Skills
leadershipmentoringteam managementcommunicationincident commandproblem-solvingstrategic planningtrainingcollaborationorganizational skills