
Senior Incident Manager
Databricks
full-time
Posted on:
Location Type: Remote
Location: Remote • California • 🇺🇸 United States
Visit company websiteSalary
💰 $143,300 - $200,600 per year
Job Level
Senior
Tech Stack
AWSAzureCloudDistributed SystemsElasticSearchGoGoogle Cloud PlatformGrafanaPrometheusPythonSplunk
About the role
- Lead critical incidents — coordinate multi-disciplinary response efforts across Databricks’ cloud-based services to rapidly mitigate impact and restore operations.
- Drive technical root cause analysis and Reliability improvements:
- collaborate with engineering teams to trace and document underlying causes across distributed systems, services, and data stores.
- Summarize key learnings, clearly communicate action items, and ensure that technical and procedural improvements are followed through.
- Own communications during incidents — deliver frequent, high-quality updates to internal stakeholders (executives, engineering leadership, support) and compose and publish customer-facing notifications that are accurate, timely, and empathetic.
- Mentor and train peers in both incident communication and technical response disciplines to raise the overall quality of Databricks’ incident response.
Requirements
- 5+ years of experience in incident management, site reliability engineering, or production operations supporting large-scale, cloud-native systems.
- Proven ability to lead and coordinate high-severity incidents, including identifying impact, isolating fault domains, and managing multi-team response efforts.
- Strong understanding of cloud infrastructure (AWS, Azure, or GCP) — including compute, networking, storage, and observability components.
- Deep expertise in log analysis and debugging:
- Familiarity with log aggregation and search tools (e.g., Datadog, Elasticsearch, Splunk, Cloud Logging, or OpenTelemetry).
- Hands-on experience with observability systems — metrics, logging, and tracing frameworks (Prometheus, Grafana, OpenTelemetry, etc.).
- Proficiency in at least one major programming or scripting language (Python, Go, or Bash) for automating diagnostics, data collection, or analysis.
- Experience developing and maintaining incident playbooks and communication templates to ensure consistent, timely updates.
- Excellent contextual interpretation and writing skills, as well as the ability to effectively summarize and communicate to both technical and business audiences, are required.
- BS, Master's or other advanced degree in Computer Science or Computer Engineering, or related Engineering field.
Benefits
- At Databricks, we strive to provide comprehensive benefits and perks that meet the needs of all of our employees.
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
incident managementsite reliability engineeringcloud-native systemslog analysisdebuggingprogrammingscriptingobservability systemsincident playbookscommunication templates
Soft skills
leadershipcommunicationmentoringcollaborationcontextual interpretationwritingsummarization