Senior Incident Manager

Databricks

full-time

Posted on: 11/10/2025

Location Type: Remote

Location: Remote • California • 🇺🇸 United States

Visit company website

✨ AI Apply

Apply

Salary

💰 $143,300 - $200,600 per year

Job Level

Senior

Tech Stack

AWSAzureCloudDistributed SystemsElasticSearchGoGoogle Cloud PlatformGrafanaPrometheusPythonSplunk

About the role

Lead critical incidents — coordinate multi-disciplinary response efforts across Databricks’ cloud-based services to rapidly mitigate impact and restore operations.
Drive technical root cause analysis and Reliability improvements:
collaborate with engineering teams to trace and document underlying causes across distributed systems, services, and data stores.
Summarize key learnings, clearly communicate action items, and ensure that technical and procedural improvements are followed through.
Own communications during incidents — deliver frequent, high-quality updates to internal stakeholders (executives, engineering leadership, support) and compose and publish customer-facing notifications that are accurate, timely, and empathetic.
Mentor and train peers in both incident communication and technical response disciplines to raise the overall quality of Databricks’ incident response.

Requirements

5+ years of experience in incident management, site reliability engineering, or production operations supporting large-scale, cloud-native systems.
Proven ability to lead and coordinate high-severity incidents, including identifying impact, isolating fault domains, and managing multi-team response efforts.
Strong understanding of cloud infrastructure (AWS, Azure, or GCP) — including compute, networking, storage, and observability components.
Deep expertise in log analysis and debugging:
Familiarity with log aggregation and search tools (e.g., Datadog, Elasticsearch, Splunk, Cloud Logging, or OpenTelemetry).
Hands-on experience with observability systems — metrics, logging, and tracing frameworks (Prometheus, Grafana, OpenTelemetry, etc.).
Proficiency in at least one major programming or scripting language (Python, Go, or Bash) for automating diagnostics, data collection, or analysis.
Experience developing and maintaining incident playbooks and communication templates to ensure consistent, timely updates.
Excellent contextual interpretation and writing skills, as well as the ability to effectively summarize and communicate to both technical and business audiences, are required.
BS, Master's or other advanced degree in Computer Science or Computer Engineering, or related Engineering field.

Benefits

At Databricks, we strive to provide comprehensive benefits and perks that meet the needs of all of our employees.

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills

incident managementsite reliability engineeringcloud-native systemslog analysisdebuggingprogrammingscriptingobservability systemsincident playbookscommunication templates

Soft skills

leadershipcommunicationmentoringcollaborationcontextual interpretationwritingsummarization