FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.

Director of Customer Reliability Engineering
Nexus CognitiveDirector of Customer Reliability Engineering at Nexus responsible for building a customer reliability function from scratch. Leading support operations and engineering teams in a B2B SaaS environment.
Tech Stack
Tools & technologiesAirflowGrafanaKubernetesPrometheusSpark
About the role
Key responsibilities & impact- The Director of Customer Reliability Engineering owns NX1's customer reliability function end-to-end.
- Design and build the L1–L3 support model from scratch — tier definitions, escalation matrix, on-call rotation, SLA/SLO framework, and the AI-native layer that makes it scale
- Own 24x7 production support with hard SLAs across a growing enterprise customer base
- Build the knowledge management and runbook infrastructure that support engineers and AI agents both rely on
- Choose and configure the support tooling stack (ticketing, incident management, observability, status page) as deliberate design choices, not defaults
- Drive operational comms during incidents: P1 cadence, postmortems customers can read, written substance for executive escalations
- Provide credible L1–L3 support across both stacks.
- Design support tiers, on-call rotations, and capacity models that handle both stacks within a single operational frame
- Build the production operations function from scratch — for customer environments NX1 operates (Cloudera-customer book today, NX1-operated customers as they come online)
- Own platform upgrades end-to-end: scheduling, pre-upgrade testing in dev/staging, production execution, rollback discipline, change-management hygiene
- Own the monitoring and alerting infrastructure: what gets measured, what triggers an alert, what's automated, what escalates to a human, what's documented as a runbook vs. a one-pager exception
- Build the automation backbone: runbook execution automation, deploy/upgrade tooling, alert routing, escalation logic, automated remediation for known failure patterns.
- Hire and develop a team of support engineers and managed service engineers spanning US and India
- Build sustainable on-call practices — real geo coverage, comp-time discipline, escalation hygiene
- Protect the team from pre-sales pull; build the handoff model with sales and customer success that keeps support engineers focused
Requirements
What you’ll need- 8–12 years of experience; 4+ years in customer reliability, support, OR production engineering leadership
- Has built a tiered support and/or production-engineering organization from 0-to-1 at a B2B SaaS or data platform company — ideally as a senior IC or first-time leader, not as a veteran of a 50-person org
- Has run a global team of at least 15 people across at least two geographies (we want a builder coming up the curve)
- Has personally led platform upgrade rollouts on customer-facing production environments and managed at least one rollback — upgrade discipline is a hands-on muscle, not a process muscle
- Owned 24x7 production support with hard SLAs and live enterprise customer escalations
- Recent operational experience (last 3 years) with Kubernetes-native, modern open-source data infrastructure as the primary stack — Spark, Trino, Airflow, Iceberg, object storage, containerized orchestration
- Hands-on with modern observability and on-call tooling: OpenTelemetry, Prometheus, Grafana, PagerDuty / incident.io, structured logging, distributed tracing
- Has actually shipped AI or automation into a support OR production-ops workflow with measurable outcomes — not 'evaluating' or 'exploring'
- High agency, builder over maintainer, founding mindset. Will write the playbook AND the early automation from a blank page
- Bias toward written clarity. Runbooks, postmortems, status updates, exec briefs live in writing
- Atlanta-based or genuinely willing to relocate
Benefits
Comp & perks- A collaborative team culture built on curiosity and respect
- Challenging work where your contributions clearly matter
- A leadership team that invests in learning and development
- The opportunity to work at the intersection of cloud, data, and AI innovation
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
customer reliability engineeringproduction supporttiered support modelplatform upgradesautomationKubernetesSparkTrinoAirflowobservability
Soft Skills
leadershipteam developmentcommunicationwritten clarityoperational disciplinebuilder mindsethigh agencyescalation managementincident managementcapacity planning