FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.

Senior Incident Manager
LambdaSenior Incident Manager overseeing critical incident management for AI cloud infrastructure. Ensuring rapid resolution and operational resilience across data center operations and engineering teams.
Posted 6/4/2026full-timeRemote • California • 🇺🇸 United StatesSenior💰 $125,000 - $195,000 per yearWebsite
Tech Stack
Tools & technologiesCloudGrafanaPrometheusServiceNow
About the role
Key responsibilities & impact- Lead the response to critical (SEV-1 / SEV-2) incidents impacting AI infrastructure, GPU clusters, networking, storage, and data center operations.
- Serve as the Incident Commander during major outages, coordinating engineering, networking, facilities, and vendor teams.
- Act as the liaison between leadership and external teams during incidents/post-incidents to provide updates and status summaries.
- Own the incident response lifecycle including:
- - Assisting Technical Triage
- - Escalation
- - Coordination
- - Resolution
- Ensure timely and accurate communication with internal stakeholders and leadership.
- Maintain incident response documentation and operational playbooks.
- Conduct analysis on incidents and identify patterns/trends for improvement in response and systems reliability.
- Work in an On-Call Rotation to respond to, lead, and coordinate incidents
- Drive alignment during outages involving multiple infrastructure layers.
- Lead post-incident reviews (PIRs) and root cause analysis. Identify systemic reliability gaps and implement corrective actions.
Requirements
What you’ll need- 8+ years experience in incident management, site reliability engineering, or infrastructure operations
- Experience managing incidents in large-scale distributed infrastructure environments
- Strong understanding of:
- - Data center operations
- - GPU compute clusters
- - Networking and storage infrastructure
- - Cloud or hybrid infrastructure platforms
- Proven ability to lead high-pressure incident response situations
- Experience with incident management frameworks (ITIL, SRE, or equivalent)
- Excellent communication and stakeholder management skills
- Experience with incident tracking and monitoring tools such as:
- - PagerDuty
- - ServiceNow
- - Jira
- - Datadog
- - Prometheus / Grafana
Benefits
Comp & perks- Health, dental, and vision coverage for you and your dependents
- Wellness and commuter stipends for select roles
- 401k Plan with 2% company match (USA employees)
- Flexible paid time off plan that we all actually use
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
incident managementsite reliability engineeringinfrastructure operationsdata center operationsGPU compute clustersnetworking infrastructurestorage infrastructurecloud infrastructurehybrid infrastructureincident management frameworks
Soft Skills
leadershipcommunicationstakeholder managementcoordinationanalysisproblem-solvingteam collaborationadaptabilitydecision-makingtime management