Incident Engineer

Netomi

Incident Manager leading end-to-end incident response for AI and platform stack at Netomi. Ensure robust resolution and communication during incidents impacting customers and internal systems.

Posted 4/10/2026full-timeRemote • 🇮🇳 IndiaMid-LevelSeniorWebsite

Tech Stack

Tools & technologies

AWSCloudDistributed Systems

About the role

Key responsibilities & impact

Own the incident lifecycle: detection, triage, escalation, resolution, and postmortems
Act as the central command during major incidents (war rooms, stakeholder updates)
Define and enforce SLAs/SLOs, incident severity frameworks, and runbooks
Collaborate with Engineering, ML, and Integrations teams to resolve issues quickly
Monitor system health across integrations (agent desks, LLMs, ASR/TTS pipelines)
Drive root cause analysis (RCA) and preventive actions
Improve observability, alerting, and incident tooling
Maintain clear internal and customer-facing communication during incidents

Requirements

What you’ll need

3–6 years in Incident Management / SRE / Production Support roles
Strong understanding of distributed systems, APIs, and cloud environments (AWS)
Experience with observability tools (e.g., DataDog)
Familiarity with AI/ML systems, especially LLM integrations and voice stacks (ASR/TTS), is a plus
Experience with monitoring/tracing tools like Langfuse or similar
Excellent communication and stakeholder management skills
Ability to stay calm under pressure and drive structured resolution

Benefits

Comp & perks

Equal opportunity employer committed to diversity

ATS Keywords

✓ Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

incident managementsite reliability engineeringproduction supportdistributed systemsAPIscloud environmentsobservability toolsmonitoring toolsroot cause analysisincident tooling

Soft Skills

communicationstakeholder managementcalm under pressurestructured resolution