
Staff Software Engineer I – SRE
Confluent
full-time
Posted on:
Location Type: Remote
Location: India
Visit company websiteExplore more
Job Level
About the role
- Analyze systemic failure patterns and design improvements that prevent incident recurrence
- Define and maintain SLO/SLA frameworks; use error budgets to guide reliability investments
- Build tooling and automation to reduce incident response toil and scale team impact
- Own Rootly configuration, workflows, and integrations with PagerDuty, Jira, Confluence, and Slack
- Analyze reliability data to identify systemic improvements; build dashboards that drive action
- Explore AI-assisted approaches to documentation quality and incident analysis
- Design scalable reliability standards that reduce reactive workload over time.
- Own standards, practices, and continuous improvement of incident response
- Define incident commander eligibility criteria and manage the rotation
- Available as escalation IC when incidents exceed a team's management chain
- Develop and deliver training programs for engineering teams at all levels
- Coach teams through post-mortems and on developing actionable corrective actions
- Edit and review customer-facing incident documents to ensure quality and clarity
- Drive turnaround SLAs while maintaining technical accuracy
- Ensure clear explanation of what happened, why, and how we'll prevent recurrence
- Partner with engineering leaders to elevate reliability practices
- Be the expert who teams proactively engage for guidance
Requirements
- 10+ years in SRE, incident management, or reliability engineering
- Cloud experience with at least one of AWS, GCP, or Azure
- Deep expertise with incident management tooling (Rootly, PagerDuty, or similar platforms)
- Strong understanding of distributed systems and failure modes at scale—Kafka/event streaming expertise preferred, or demonstrated rapid mastery of complex systems
- Deep experience with observability: metrics, logging, tracing—ability to diagnose complex issues
- Kubernetes and container orchestration experience
- Understanding of CI/CD pipelines and release processes
- Systems thinking: understanding how infrastructure design choices affect failure modes and recovery
- Familiarity with SLO/SLA frameworks.
- Track record as a trusted advisor across engineering organizations
- Experience driving org-wide process and cultural changes
- Strong written communication (design docs, one-pagers, runbooks)
- Post-mortem facilitation experience
- Experience with async collaboration across time zones
- Large company experience navigating reliability/incident programs at 500+ engineer organizations
Benefits
- 📊 Check your resume score for this job Improve your chances of getting an interview by checking your resume score before you apply. Check Resume Score
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
SREincident managementreliability engineeringcloud computingAWSGCPAzureKubernetesCI/CDobservability
Soft Skills
coachingcommunicationprocess improvementcollaborationtrust advisorfacilitationtraining developmentsystems thinkingincident responsepost-mortem analysis