Confluent

Staff Site Reliability Engineer – Incident Management & Reliability

Confluent

full-time

Posted on:

Location Type: Remote

Location: Canada

Visit company website

Explore more

AI Apply
Apply

Salary

💰 CA$225,100 - CA$264,500 per year

Job Level

About the role

  • Analyze systemic failure patterns and design reliability improvements that prevent incident recurrence
  • Own Rootly configuration, workflows, and integrations with PagerDuty, Jira, Confluence, and Slack
  • Define and maintain SLO/SLA frameworks; use error budgets to guide reliability investments
  • Own standards, practices, and continuous improvement of incident response across engineering
  • Edit and review customer-facing incident documents (CRCAs) to ensure quality and clarity
  • Develop and deliver training programs; coach teams through post-mortems
  • Partner with engineering leaders to elevate reliability practices org-wide

Requirements

  • 10+ years of relevant experience in SRE, incident management, or reliability engineering
  • Cloud experience with at least one of AWS, GCP, or Azure (we run all three)
  • Experience navigating reliability/incident programs at 500+ engineer organizations
  • Deep expertise with incident management tooling (Rootly, PagerDuty, or similar)
  • Strong understanding of distributed systems and failure modes at scale
  • Deep experience with observability: metrics, logging, tracing
  • Kubernetes and container orchestration experience
  • Understanding of CI/CD pipelines and release processes
  • Strong written communication (design docs, runbooks, post-mortems)
  • Experience driving org-wide process and cultural changes
  • Kafka/event streaming expertise preferred, or demonstrated rapid mastery of complex systems
Benefits
  • Offers Equity 📊 Check your resume score for this job Improve your chances of getting an interview by checking your resume score before you apply. Check Resume Score
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
SREincident managementreliability engineeringcloud computingAWSGCPAzureKubernetesCI/CDobservability
Soft Skills
strong written communicationcoachingcontinuous improvementprocess changecollaboration