Databricks

Senior Incident Manager

Databricks

full-time

Posted on:

Location Type: Remote

Location: Remote • California • 🇺🇸 United States

Visit company website
AI Apply
Apply

Salary

💰 $143,300 - $200,600 per year

Job Level

Senior

Tech Stack

AWSAzureCloudDistributed SystemsElasticSearchGoGoogle Cloud PlatformGrafanaPrometheusPythonSplunk

About the role

  • Lead critical incidents — coordinate multi-disciplinary response efforts across Databricks’ cloud-based services to rapidly mitigate impact and restore operations.
  • Drive technical root cause analysis and Reliability improvements:
  • collaborate with engineering teams to trace and document underlying causes across distributed systems, services, and data stores.
  • Summarize key learnings, clearly communicate action items, and ensure that technical and procedural improvements are followed through.
  • Own communications during incidents — deliver frequent, high-quality updates to internal stakeholders (executives, engineering leadership, support) and compose and publish customer-facing notifications that are accurate, timely, and empathetic.
  • Mentor and train peers in both incident communication and technical response disciplines to raise the overall quality of Databricks’ incident response.

Requirements

  • 5+ years of experience in incident management, site reliability engineering, or production operations supporting large-scale, cloud-native systems.
  • Proven ability to lead and coordinate high-severity incidents, including identifying impact, isolating fault domains, and managing multi-team response efforts.
  • Strong understanding of cloud infrastructure (AWS, Azure, or GCP) — including compute, networking, storage, and observability components.
  • Deep expertise in log analysis and debugging:
  • Familiarity with log aggregation and search tools (e.g., Datadog, Elasticsearch, Splunk, Cloud Logging, or OpenTelemetry).
  • Hands-on experience with observability systems — metrics, logging, and tracing frameworks (Prometheus, Grafana, OpenTelemetry, etc.).
  • Proficiency in at least one major programming or scripting language (Python, Go, or Bash) for automating diagnostics, data collection, or analysis.
  • Experience developing and maintaining incident playbooks and communication templates to ensure consistent, timely updates.
  • Excellent contextual interpretation and writing skills, as well as the ability to effectively summarize and communicate to both technical and business audiences, are required.
  • BS, Master's or other advanced degree in Computer Science or Computer Engineering, or related Engineering field.
Benefits
  • At Databricks, we strive to provide comprehensive benefits and perks that meet the needs of all of our employees.

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills
incident managementsite reliability engineeringcloud-native systemslog analysisdebuggingprogrammingscriptingobservability systemsincident playbookscommunication templates
Soft skills
leadershipcommunicationmentoringcollaborationcontextual interpretationwritingsummarization