Confluent

Staff Software Engineer I – SRE

Confluent

full-time

Posted on:

Location Type: Remote

Location: India

Visit company website

Explore more

AI Apply
Apply

Job Level

About the role

  • Analyze systemic failure patterns and design improvements that prevent incident recurrence
  • Define and maintain SLO/SLA frameworks; use error budgets to guide reliability investments
  • Build tooling and automation to reduce incident response toil and scale team impact
  • Own Rootly configuration, workflows, and integrations with PagerDuty, Jira, Confluence, and Slack
  • Analyze reliability data to identify systemic improvements; build dashboards that drive action
  • Explore AI-assisted approaches to documentation quality and incident analysis
  • Design scalable reliability standards that reduce reactive workload over time.
  • Own standards, practices, and continuous improvement of incident response
  • Define incident commander eligibility criteria and manage the rotation
  • Available as escalation IC when incidents exceed a team's management chain
  • Develop and deliver training programs for engineering teams at all levels
  • Coach teams through post-mortems and on developing actionable corrective actions
  • Edit and review customer-facing incident documents to ensure quality and clarity
  • Drive turnaround SLAs while maintaining technical accuracy
  • Ensure clear explanation of what happened, why, and how we'll prevent recurrence
  • Partner with engineering leaders to elevate reliability practices
  • Be the expert who teams proactively engage for guidance

Requirements

  • 10+ years in SRE, incident management, or reliability engineering
  • Cloud experience with at least one of AWS, GCP, or Azure
  • Deep expertise with incident management tooling (Rootly, PagerDuty, or similar platforms)
  • Strong understanding of distributed systems and failure modes at scale—Kafka/event streaming expertise preferred, or demonstrated rapid mastery of complex systems
  • Deep experience with observability: metrics, logging, tracing—ability to diagnose complex issues
  • Kubernetes and container orchestration experience
  • Understanding of CI/CD pipelines and release processes
  • Systems thinking: understanding how infrastructure design choices affect failure modes and recovery
  • Familiarity with SLO/SLA frameworks.
  • Track record as a trusted advisor across engineering organizations
  • Experience driving org-wide process and cultural changes
  • Strong written communication (design docs, one-pagers, runbooks)
  • Post-mortem facilitation experience
  • Experience with async collaboration across time zones
  • Large company experience navigating reliability/incident programs at 500+ engineer organizations
Benefits
  • 📊 Check your resume score for this job Improve your chances of getting an interview by checking your resume score before you apply. Check Resume Score
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
SREincident managementreliability engineeringcloud computingAWSGCPAzureKubernetesCI/CDobservability
Soft Skills
coachingcommunicationprocess improvementcollaborationtrust advisorfacilitationtraining developmentsystems thinkingincident responsepost-mortem analysis