Cloud Reliability Manager

Boeing

Cloud Reliability Manager leading Runtime SRE and Cloud Operations at Boeing. Ensuring reliability, scalability, and operational excellence across multi-cloud environments.

Posted 5/21/2026full-timeSeattle • California, Illinois, Montana, Washington • 🇺🇸 United StatesMid-LevelSenior💰 $161,500 - $233,450 per yearWebsite

Tech Stack

Tools & technologies

CloudElasticSearchGrafanaKubernetesLogstashPrometheusTerraform

About the role

Key responsibilities & impact

Own strategy, roadmap, and delivery for Runtime SRE and Cloud Operations to meet enterprise Service Level Objectives (SLOs) and operational Service-Level Agreements (SLAs)
Lead, mentor, and grow teams responsible for runtime SRE (SLOs/SLIs, observability, performance engineering, Disaster Recovery (DR), chaos testing) and Cloud Operations
Establish and own incident management processes: detection, escalation, incident command, post-incident reviews, and remediation planning; ensure rapid detection and reduced Mean Time to Repair (MTTR)
Drive observability and telemetry strategy (metrics, tracing, logs) to ensure actionable alerts and proactive detection of platform issues
Lead capacity planning, performance tuning, and disaster recovery orchestration for platform services and multi-cluster fleets
Convert Root Cause Analysis (RCA) outcomes into prioritized engineering work
Define and measure operational Key Performance Indicator (KPIs) and implement automation to reduce manual toil
Own on-call and rotation policies, runbook quality, bridge setup SLAs, and operational playbooks; ensure teams are trained and drills executed regularly
Ensure security, compliance, and change management controls are integrated into operational procedures and emergency responses

Requirements

What you’ll need

5+ years in cloud operations, SRE, and/or related roles
3+ years managing technical teams with on-call responsibilities
3+ years of experience with Kubernetes at scale and multi-cloud runtime platforms (EKS/AKS/GKE)
3+ years of experience with observability tooling (Prometheus, Grafana, OpenTelemetry, Elasticsearch, Logstash, Kibana (ELK), Fluentd, Kibana (EFK), tracing) and alerting design
Experience owning incident response and improving reliability metrics in production environments
Experience with capacity planning, performance engineering, and disaster recovery at cloud scale
Experience with automation tooling (Terraform, CI/CD, operators) and integrating reliability into IaC pipelines

Benefits

Comp & perks

health insurance
flexible spending accounts
health savings accounts
retirement savings plans
life and disability insurance programs
paid and unpaid time away from work

ATS Keywords

✓ Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

cloud operationssite reliability engineering (SRE)incident managementcapacity planningperformance tuningdisaster recovery (DR)root cause analysis (RCA)automationobservabilityalerting design

Soft Skills

leadershipmentoringteam managementcommunicationincident commandproblem-solvingstrategic planningtrainingcollaborationorganizational skills