Apply

Ready to go for it?

AI Apply speeds things up—apply directly if you prefer.

FREE ACCESS
5,000–10,000 jobs/day
JobTailor Logo

See all jobs on JobTailor

Search thousands of fresh jobs every day.

Discover
  • Fresh listings
  • Fast filters
  • No subscription required
Create a free account and start exploring right away.
Nexus Cognitive

Director of Customer Reliability Engineering

Nexus Cognitive

Director of Customer Reliability Engineering at Nexus responsible for building a customer reliability function from scratch. Leading support operations and engineering teams in a B2B SaaS environment.

Posted 6/12/2026full-timeAtlanta • 🇺🇸 United StatesLeadWebsite

Tech Stack

Tools & technologies
AirflowGrafanaKubernetesPrometheusSpark

About the role

Key responsibilities & impact
  • The Director of Customer Reliability Engineering owns NX1's customer reliability function end-to-end.
  • Design and build the L1–L3 support model from scratch — tier definitions, escalation matrix, on-call rotation, SLA/SLO framework, and the AI-native layer that makes it scale
  • Own 24x7 production support with hard SLAs across a growing enterprise customer base
  • Build the knowledge management and runbook infrastructure that support engineers and AI agents both rely on
  • Choose and configure the support tooling stack (ticketing, incident management, observability, status page) as deliberate design choices, not defaults
  • Drive operational comms during incidents: P1 cadence, postmortems customers can read, written substance for executive escalations
  • Provide credible L1–L3 support across both stacks.
  • Design support tiers, on-call rotations, and capacity models that handle both stacks within a single operational frame
  • Build the production operations function from scratch — for customer environments NX1 operates (Cloudera-customer book today, NX1-operated customers as they come online)
  • Own platform upgrades end-to-end: scheduling, pre-upgrade testing in dev/staging, production execution, rollback discipline, change-management hygiene
  • Own the monitoring and alerting infrastructure: what gets measured, what triggers an alert, what's automated, what escalates to a human, what's documented as a runbook vs. a one-pager exception
  • Build the automation backbone: runbook execution automation, deploy/upgrade tooling, alert routing, escalation logic, automated remediation for known failure patterns.
  • Hire and develop a team of support engineers and managed service engineers spanning US and India
  • Build sustainable on-call practices — real geo coverage, comp-time discipline, escalation hygiene
  • Protect the team from pre-sales pull; build the handoff model with sales and customer success that keeps support engineers focused

Requirements

What you’ll need
  • 8–12 years of experience; 4+ years in customer reliability, support, OR production engineering leadership
  • Has built a tiered support and/or production-engineering organization from 0-to-1 at a B2B SaaS or data platform company — ideally as a senior IC or first-time leader, not as a veteran of a 50-person org
  • Has run a global team of at least 15 people across at least two geographies (we want a builder coming up the curve)
  • Has personally led platform upgrade rollouts on customer-facing production environments and managed at least one rollback — upgrade discipline is a hands-on muscle, not a process muscle
  • Owned 24x7 production support with hard SLAs and live enterprise customer escalations
  • Recent operational experience (last 3 years) with Kubernetes-native, modern open-source data infrastructure as the primary stack — Spark, Trino, Airflow, Iceberg, object storage, containerized orchestration
  • Hands-on with modern observability and on-call tooling: OpenTelemetry, Prometheus, Grafana, PagerDuty / incident.io, structured logging, distributed tracing
  • Has actually shipped AI or automation into a support OR production-ops workflow with measurable outcomes — not 'evaluating' or 'exploring'
  • High agency, builder over maintainer, founding mindset. Will write the playbook AND the early automation from a blank page
  • Bias toward written clarity. Runbooks, postmortems, status updates, exec briefs live in writing
  • Atlanta-based or genuinely willing to relocate

Benefits

Comp & perks
  • A collaborative team culture built on curiosity and respect
  • Challenging work where your contributions clearly matter
  • A leadership team that invests in learning and development
  • The opportunity to work at the intersection of cloud, data, and AI innovation

ATS Keywords

✓ Tailor your resume
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
customer reliability engineeringproduction supporttiered support modelplatform upgradesautomationKubernetesSparkTrinoAirflowobservability
Soft Skills
leadershipteam developmentcommunicationwritten clarityoperational disciplinebuilder mindsethigh agencyescalation managementincident managementcapacity planning