FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.
Tech Stack
Tools & technologiesAWSCloudDistributed Systems
About the role
Key responsibilities & impact- Own the incident lifecycle: detection, triage, escalation, resolution, and postmortems
- Act as the central command during major incidents (war rooms, stakeholder updates)
- Define and enforce SLAs/SLOs, incident severity frameworks, and runbooks
- Collaborate with Engineering, ML, and Integrations teams to resolve issues quickly
- Monitor system health across integrations (agent desks, LLMs, ASR/TTS pipelines)
- Drive root cause analysis (RCA) and preventive actions
- Improve observability, alerting, and incident tooling
- Maintain clear internal and customer-facing communication during incidents
Requirements
What you’ll need- 3–6 years in Incident Management / SRE / Production Support roles
- Strong understanding of distributed systems, APIs, and cloud environments (AWS)
- Experience with observability tools (e.g., DataDog)
- Familiarity with AI/ML systems, especially LLM integrations and voice stacks (ASR/TTS), is a plus
- Experience with monitoring/tracing tools like Langfuse or similar
- Excellent communication and stakeholder management skills
- Ability to stay calm under pressure and drive structured resolution
Benefits
Comp & perks- Equal opportunity employer committed to diversity
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
incident managementsite reliability engineeringproduction supportdistributed systemsAPIscloud environmentsobservability toolsmonitoring toolsroot cause analysisincident tooling
Soft Skills
communicationstakeholder managementcalm under pressurestructured resolution
