
Reliability Operations Manager
Liquid Web
full-time
Posted on:
Location Type: Remote
Location: Missouri • United States
Visit company websiteExplore more
Salary
💰 $110,000 - $150,000 per year
Tech Stack
About the role
- Own Reliability Operations & Incident Command
- Continuously evolve and improve incident management, change management, and post-incident practices
- Establish clear standards for incident declaration, severity, escalation, and communication
- Ensure consistent execution across teams and continuous process improvement
- Own the incident command function, including roles, structure, and operating procedures
- Lead or oversee major incident response in a 24/7 production environment
- Build and manage on-call incident commander rotations with global coverage
- Own post-incident reviews, ensuring strong root cause analysis and clear documentation
- Translate incident trends into actionable reliability improvements
- Drive completion of corrective actions across teams; escalate when needed
- Define and maintain service performance and reliability targets (availability, latency, error rates)
- Own observability strategy, including monitoring, alerting, and signal quality
- Improve detection, reduce time to resolution, and increase platform resilience
- Partner with Engineering and Operations on capacity planning, patching, and lifecycle decisions
- Ensure reliability insights directly inform platform and infrastructure roadmaps
- Collaborate with Security on vulnerability response, patch prioritization, and compliance alignment
- Work across environments including virtualization platforms (VMware), distributed storage (Ceph), Linux-based systems, and hybrid cloud infrastructure
- Provide regular, data-driven reporting to leadership on availability, incident trends, and operational performance
- Act as the central authority on reliability insights across teams
Requirements
- Bachelor’s degree in Computer Science, Engineering, or a related field (or equivalent practical experience)
- 7+ experience in systems operations, site reliability, or platform engineering
- 2+ years experience leading teams or major operational functions
- Proven experience managing incidents in a 24/7 production environment
- Strong background in troubleshooting, root cause analysis, and operational improvement
- Experience with change management practices
- Monitoring and observability platforms (e.g., Datadog, Prometheus, Grafana, New Relic)
- Incident management and alerting tools (e.g., PagerDuty, Opsgenie)
- Infrastructure and platform technologies (Linux systems, VMware, Ceph, cloud platforms)
- Logging and telemetry systems (centralized logging, metrics, tracing)
- Ability to translate complex technical data into clear insights
- Strong communication skills, especially in high-pressure situations
Benefits
- Traditional and Roth 401k with company matching
- A collaborative team culture
- Consistent/set work hours
- Challenging non-redundant daily duties
- A voice in how things get done
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
incident managementchange managementroot cause analysistroubleshootingoperational improvementservice performance targetsreliability improvementsdata-driven reportingcapacity planningplatform resilience
Soft Skills
leadershipcommunicationcollaborationproblem-solvingadaptabilitycritical thinkingdecision-makingtime managementinterpersonal skillshigh-pressure situation management
Certifications
Bachelor’s degree in Computer ScienceBachelor’s degree in Engineeringrelated field degreeequivalent practical experience