Liquid Web

Reliability Operations Manager

Liquid Web

full-time

Posted on:

Location Type: Remote

Location: MissouriUnited States

Visit company website

Explore more

AI Apply
Apply

Salary

💰 $110,000 - $150,000 per year

Job Level

About the role

  • Own Reliability Operations & Incident Command
  • Continuously evolve and improve incident management, change management, and post-incident practices
  • Establish clear standards for incident declaration, severity, escalation, and communication
  • Ensure consistent execution across teams and continuous process improvement
  • Own the incident command function, including roles, structure, and operating procedures
  • Lead or oversee major incident response in a 24/7 production environment
  • Build and manage on-call incident commander rotations with global coverage
  • Own post-incident reviews, ensuring strong root cause analysis and clear documentation
  • Translate incident trends into actionable reliability improvements
  • Drive completion of corrective actions across teams; escalate when needed
  • Define and maintain service performance and reliability targets (availability, latency, error rates)
  • Own observability strategy, including monitoring, alerting, and signal quality
  • Improve detection, reduce time to resolution, and increase platform resilience
  • Partner with Engineering and Operations on capacity planning, patching, and lifecycle decisions
  • Ensure reliability insights directly inform platform and infrastructure roadmaps
  • Collaborate with Security on vulnerability response, patch prioritization, and compliance alignment
  • Work across environments including virtualization platforms (VMware), distributed storage (Ceph), Linux-based systems, and hybrid cloud infrastructure
  • Provide regular, data-driven reporting to leadership on availability, incident trends, and operational performance
  • Act as the central authority on reliability insights across teams

Requirements

  • Bachelor’s degree in Computer Science, Engineering, or a related field (or equivalent practical experience)
  • 7+ experience in systems operations, site reliability, or platform engineering
  • 2+ years experience leading teams or major operational functions
  • Proven experience managing incidents in a 24/7 production environment
  • Strong background in troubleshooting, root cause analysis, and operational improvement
  • Experience with change management practices
  • Monitoring and observability platforms (e.g., Datadog, Prometheus, Grafana, New Relic)
  • Incident management and alerting tools (e.g., PagerDuty, Opsgenie)
  • Infrastructure and platform technologies (Linux systems, VMware, Ceph, cloud platforms)
  • Logging and telemetry systems (centralized logging, metrics, tracing)
  • Ability to translate complex technical data into clear insights
  • Strong communication skills, especially in high-pressure situations
Benefits
  • Traditional and Roth 401k with company matching
  • A collaborative team culture
  • Consistent/set work hours
  • Challenging non-redundant daily duties
  • A voice in how things get done
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
incident managementchange managementroot cause analysistroubleshootingoperational improvementservice performance targetsreliability improvementsdata-driven reportingcapacity planningplatform resilience
Soft Skills
leadershipcommunicationcollaborationproblem-solvingadaptabilitycritical thinkingdecision-makingtime managementinterpersonal skillshigh-pressure situation management
Certifications
Bachelor’s degree in Computer ScienceBachelor’s degree in Engineeringrelated field degreeequivalent practical experience