Director of Customer Reliability Engineering

Nexus Cognitive

Director of Customer Reliability Engineering at Nexus responsible for building a customer reliability function from scratch. Leading support operations and engineering teams in a B2B SaaS environment.

Posted 6/11/2026full-timeAtlanta • 🇺🇸 United StatesLeadWebsite

ATS Keywords

Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills

customer reliability engineeringproduction supporttiered support modelplatform upgradesautomationKubernetesSparkTrinoAirflowobservability

Soft Skills

leadershipteam developmentcommunicationwritten clarityoperational disciplinebuilder mindsethigh agencyescalation managementincident managementcapacity planning

Tools & Technologies

OpenTelemetryPrometheusGrafanaPagerDutyincident.iostructured loggingdistributed tracingticketing systemsknowledge managementrunbook infrastructure

Industry Keywords

B2B SaaSdata platform24x7 supportSLASLOAI-nativeglobal team managementproduction environmentsrollback disciplinecustomer escalations

Tech Stack

Tools & technologies

AirflowGrafanaKubernetesPrometheusSpark

About the role

Key responsibilities & impact

The Director of Customer Reliability Engineering owns NX1's customer reliability function end-to-end.
Design and build the L1–L3 support model from scratch — tier definitions, escalation matrix, on-call rotation, SLA/SLO framework, and the AI-native layer that makes it scale
Own 24x7 production support with hard SLAs across a growing enterprise customer base
Build the knowledge management and runbook infrastructure that support engineers and AI agents both rely on
Choose and configure the support tooling stack (ticketing, incident management, observability, status page) as deliberate design choices, not defaults
Drive operational comms during incidents: P1 cadence, postmortems customers can read, written substance for executive escalations
Provide credible L1–L3 support across both stacks.
Design support tiers, on-call rotations, and capacity models that handle both stacks within a single operational frame
Build the production operations function from scratch — for customer environments NX1 operates (Cloudera-customer book today, NX1-operated customers as they come online)
Own platform upgrades end-to-end: scheduling, pre-upgrade testing in dev/staging, production execution, rollback discipline, change-management hygiene
Own the monitoring and alerting infrastructure: what gets measured, what triggers an alert, what's automated, what escalates to a human, what's documented as a runbook vs. a one-pager exception
Build the automation backbone: runbook execution automation, deploy/upgrade tooling, alert routing, escalation logic, automated remediation for known failure patterns.
Hire and develop a team of support engineers and managed service engineers spanning US and India
Build sustainable on-call practices — real geo coverage, comp-time discipline, escalation hygiene
Protect the team from pre-sales pull; build the handoff model with sales and customer success that keeps support engineers focused

Requirements

What you’ll need

8–12 years of experience; 4+ years in customer reliability, support, OR production engineering leadership
Has built a tiered support and/or production-engineering organization from 0-to-1 at a B2B SaaS or data platform company — ideally as a senior IC or first-time leader, not as a veteran of a 50-person org
Has run a global team of at least 15 people across at least two geographies (we want a builder coming up the curve)
Has personally led platform upgrade rollouts on customer-facing production environments and managed at least one rollback — upgrade discipline is a hands-on muscle, not a process muscle
Owned 24x7 production support with hard SLAs and live enterprise customer escalations
Recent operational experience (last 3 years) with Kubernetes-native, modern open-source data infrastructure as the primary stack — Spark, Trino, Airflow, Iceberg, object storage, containerized orchestration
Hands-on with modern observability and on-call tooling: OpenTelemetry, Prometheus, Grafana, PagerDuty / incident.io, structured logging, distributed tracing
Has actually shipped AI or automation into a support OR production-ops workflow with measurable outcomes — not 'evaluating' or 'exploring'
High agency, builder over maintainer, founding mindset. Will write the playbook AND the early automation from a blank page
Bias toward written clarity. Runbooks, postmortems, status updates, exec briefs live in writing
Atlanta-based or genuinely willing to relocate

Benefits

Comp & perks

A collaborative team culture built on curiosity and respect
Challenging work where your contributions clearly matter
A leadership team that invests in learning and development
The opportunity to work at the intersection of cloud, data, and AI innovation