OpenAI

Data Center Incident Program Manager

OpenAI

full-time

Posted on:

Location Type: Remote

Location: United States

Visit company website

Explore more

AI Apply
Apply

Salary

💰 $125,600 - $228,000 per year

Job Level

About the role

  • Define and maintain incident severity levels (SEV definitions), classification criteria, and escalation thresholds.
  • Establish end-to-end incident response standards: protocols, lifecycle stages (declare → stabilize → mitigate → recover → close), and operating cadence.
  • Build and maintain governance artifacts: runbooks, war room formats, reporting templates, and decision/communication standards.
  • Create and operationalize notification trees, stakeholder comms templates (initial, periodic updates, recovery/closure), and executive escalation criteria.
  • Define clear RACI across Facilities, Hardware Ops, Network, Security, and vendor/partner teams, including handoffs and accountability paths.
  • Set and manage SLAs/OLAs for acknowledgment, escalation, containment, mitigation, and reporting.
  • Implement and run incident management tooling (ticketing, paging, logging) and ensure integrations with monitoring and workflow systems.
  • Establish dashboards and program health metrics to track incident performance and readiness.
  • Lead readiness activities: tabletop exercises, cross-functional simulations, IC/Deputy training, and a rotating on-call IC bench with certification standards.
  • Serve as Incident Commander as needed: declare severity, stand up the war room, assign functional leads, and drive structured execution under pressure.
  • Maintain real-time documentation (decisions, timelines, impact scope) and ensure clear restoration objectives and scope control during active events.
  • Run post-incident reviews (PIRs), validate timelines, drive structured RCA (e.g., 5 Whys, Fault Tree), and separate root cause vs contributing factors.
  • Define corrective/preventative actions (CAPAs), assign accountable owners, track to verified closure, and escalate overdue actions.
  • Publish trend reporting (incident taxonomy, counts by severity, MTTA/MTTR, repeat failure domains) and feed systemic gaps back into design and operations teams.

Requirements

  • 7+ years in mission-critical infrastructure, data center operations, or reliability engineering
  • Direct experience leading major incidents (P1/P0 equivalent)
  • Strong familiarity with facilities systems, hardware operations, or network infrastructure
  • Demonstrated experience running war rooms and executive updates
  • Experience conducting root cause analysis and corrective action tracking
  • Ability to remain calm and decisive under high-pressure conditions
Benefits
  • Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
  • Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
  • 401(k) retirement plan with employer match
  • Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
  • Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
  • 13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
  • Mental health and wellness support
  • Employer-paid basic life and disability coverage
  • Annual learning and development stipend to fuel your professional growth
  • Daily meals in our offices, and meal delivery credits as eligible
  • Relocation support for eligible employees
  • Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided.
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
incident managementroot cause analysiscorrective action trackingincident response standardsSLAsOLAsgovernance artifactsincident taxonomyMTTAMTTR
Soft Skills
calm under pressuredecisiveleadershipcommunicationcross-functional collaborationaccountabilitystructured executiontrainingproblem-solvingreporting