Own Reliability Operations & Incident Command
Continuously evolve and improve incident management, change management, and post-incident practices
Establish clear standards for incident declaration, severity, escalation, and communication
Ensure consistent execution across teams and continuous process improvement
Own the incident command function, including roles, structure, and operating procedures
Lead or oversee major incident response in a 24/7 production environment
Build and manage on-call incident commander rotations with global coverage
Own post-incident reviews, ensuring strong root cause analysis and clear documentation
Translate incident trends into actionable reliability improvements
Drive completion of corrective actions across teams; escalate when needed
Define and maintain service performance and reliability targets (availability, latency, error rates)
Own observability strategy, including monitoring, alerting, and signal quality
Improve detection, reduce time to resolution, and increase platform resilience
Partner with Engineering and Operations on capacity planning, patching, and lifecycle decisions
Ensure reliability insights directly inform platform and infrastructure roadmaps
Collaborate with Security on vulnerability response, patch prioritization, and compliance alignment
Work across environments including virtualization platforms (VMware), distributed storage (Ceph), Linux-based systems, and hybrid cloud infrastructure
Provide regular, data-driven reporting to leadership on availability, incident trends, and operational performance
Act as the central authority on reliability insights across teams

Requirements

Bachelor’s degree in Computer Science, Engineering, or a related field (or equivalent practical experience)
7+ experience in systems operations, site reliability, or platform engineering
2+ years experience leading teams or major operational functions
Proven experience managing incidents in a 24/7 production environment
Strong background in troubleshooting, root cause analysis, and operational improvement
Experience with change management practices
Monitoring and observability platforms (e.g., Datadog, Prometheus, Grafana, New Relic)
Incident management and alerting tools (e.g., PagerDuty, Opsgenie)
Infrastructure and platform technologies (Linux systems, VMware, Ceph, cloud platforms)
Logging and telemetry systems (centralized logging, metrics, tracing)
Ability to translate complex technical data into clear insights
Strong communication skills, especially in high-pressure situations

Benefits

Traditional and Roth 401k with company matching
A collaborative team culture
Consistent/set work hours
Challenging non-redundant daily duties
A voice in how things get done

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

incident managementchange managementroot cause analysistroubleshootingoperational improvementservice performance targetsreliability improvementsdata-driven reportingcapacity planningplatform resilience

Soft Skills

leadershipcommunicationcollaborationproblem-solvingadaptabilitycritical thinkingdecision-makingtime managementinterpersonal skillshigh-pressure situation management

Certifications

Bachelor’s degree in Computer ScienceBachelor’s degree in Engineeringrelated field degreeequivalent practical experience