Senior Incident Manager

Lambda

Senior Incident Manager overseeing critical incident management for AI cloud infrastructure. Ensuring rapid resolution and operational resilience across data center operations and engineering teams.

Posted 6/4/2026full-timeRemote • California • 🇺🇸 United StatesSenior💰 $125,000 - $195,000 per yearWebsite

ATS Keywords

Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills

incident managementsite reliability engineeringinfrastructure operationsdata center operationsGPU compute clustersnetworking infrastructurestorage infrastructurecloud infrastructurehybrid infrastructureincident management frameworks

Soft Skills

leadershipcommunicationstakeholder managementcoordinationanalysisproblem-solvingteam collaborationadaptabilitydecision-makingtime management

Tools & Technologies

PagerDutyServiceNowJiraDatadogPrometheusGrafana

Industry Keywords

SEV-1SEV-2incident response lifecyclepost-incident reviewsroot cause analysissystem reliabilityoperational playbookstechnical triageescalationoutages

Tech Stack

Tools & technologies

CloudGrafanaPrometheusServiceNow

About the role

Key responsibilities & impact

Lead the response to critical (SEV-1 / SEV-2) incidents impacting AI infrastructure, GPU clusters, networking, storage, and data center operations.
Serve as the Incident Commander during major outages, coordinating engineering, networking, facilities, and vendor teams.
Act as the liaison between leadership and external teams during incidents/post-incidents to provide updates and status summaries.
Own the incident response lifecycle including:
- Assisting Technical Triage
- Escalation
- Coordination
- Resolution
Ensure timely and accurate communication with internal stakeholders and leadership.
Maintain incident response documentation and operational playbooks.
Conduct analysis on incidents and identify patterns/trends for improvement in response and systems reliability.
Work in an On-Call Rotation to respond to, lead, and coordinate incidents
Drive alignment during outages involving multiple infrastructure layers.
Lead post-incident reviews (PIRs) and root cause analysis. Identify systemic reliability gaps and implement corrective actions.

Requirements

What you’ll need

8+ years experience in incident management, site reliability engineering, or infrastructure operations
Experience managing incidents in large-scale distributed infrastructure environments
Strong understanding of:
- Data center operations
- GPU compute clusters
- Networking and storage infrastructure
- Cloud or hybrid infrastructure platforms
Proven ability to lead high-pressure incident response situations
Experience with incident management frameworks (ITIL, SRE, or equivalent)
Excellent communication and stakeholder management skills
Experience with incident tracking and monitoring tools such as:
- PagerDuty
- ServiceNow
- Jira
- Datadog
- Prometheus / Grafana

Benefits

Comp & perks

Health, dental, and vision coverage for you and your dependents
Wellness and commuter stipends for select roles
401k Plan with 2% company match (USA employees)
Flexible paid time off plan that we all actually use