Staff Software Engineer I – SRE

Confluent

full-time

Posted on: 1/28/2026

Location Type: Remote

Location: India

Visit company website

Explore more

DevOps Engineer jobs

✨ AI Apply

Apply

Job Level

Lead

Tech Stack

AWS Azure Cloud Distributed Systems Google Cloud Platform Kafka Kubernetes

About the role

Analyze systemic failure patterns and design improvements that prevent incident recurrence
Define and maintain SLO/SLA frameworks; use error budgets to guide reliability investments
Build tooling and automation to reduce incident response toil and scale team impact
Own Rootly configuration, workflows, and integrations with PagerDuty, Jira, Confluence, and Slack
Analyze reliability data to identify systemic improvements; build dashboards that drive action
Explore AI-assisted approaches to documentation quality and incident analysis
Design scalable reliability standards that reduce reactive workload over time.
Own standards, practices, and continuous improvement of incident response
Define incident commander eligibility criteria and manage the rotation
Available as escalation IC when incidents exceed a team's management chain
Develop and deliver training programs for engineering teams at all levels
Coach teams through post-mortems and on developing actionable corrective actions
Edit and review customer-facing incident documents to ensure quality and clarity
Drive turnaround SLAs while maintaining technical accuracy
Ensure clear explanation of what happened, why, and how we'll prevent recurrence
Partner with engineering leaders to elevate reliability practices
Be the expert who teams proactively engage for guidance

Requirements

10+ years in SRE, incident management, or reliability engineering
Cloud experience with at least one of AWS, GCP, or Azure
Deep expertise with incident management tooling (Rootly, PagerDuty, or similar platforms)
Strong understanding of distributed systems and failure modes at scale—Kafka/event streaming expertise preferred, or demonstrated rapid mastery of complex systems
Deep experience with observability: metrics, logging, tracing—ability to diagnose complex issues
Kubernetes and container orchestration experience
Understanding of CI/CD pipelines and release processes
Systems thinking: understanding how infrastructure design choices affect failure modes and recovery
Familiarity with SLO/SLA frameworks.
Track record as a trusted advisor across engineering organizations
Experience driving org-wide process and cultural changes
Strong written communication (design docs, one-pagers, runbooks)
Post-mortem facilitation experience
Experience with async collaboration across time zones
Large company experience navigating reliability/incident programs at 500+ engineer organizations

Benefits

📊 Check your resume score for this job Improve your chances of getting an interview by checking your resume score before you apply. Check Resume Score

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

SREincident managementreliability engineeringcloud computingAWSGCPAzureKubernetesCI/CDobservability

Soft Skills

coachingcommunicationprocess improvementcollaborationtrust advisorfacilitationtraining developmentsystems thinkingincident responsepost-mortem analysis