7SG

Senior Director – AI Production Reliability & Trust

7SG

full-time

Posted on:

Location Type: Remote

Location: United States

Visit company website

Explore more

AI Apply
Apply

Job Level

About the role

  • Production reliability: ensuring HAP (our Hybrid AI Platform) and customer deployments operate dependably under real traffic, real data, and real regulatory constraints
  • Agent trust: building the frameworks — technical and operational — that allow enterprise customers to trust autonomous AI agents doing work on their behalf
  • A production quality operating system: quality gates, phase transition criteria, incident taxonomy, observability spec across our 6-layer Reference Architecture
  • A continuous validation framework for agentic workflows — autonomous evaluation pipelines that catch regression without human intervention
  • An agent decision qualification framework: risk-tiered oversight for autonomous agent decisions
  • A trust evidence system: the observable signals that enterprise customers use to extend trust to agents operating on their behalf
  • Production observability: instrumentation across Ingest, Prepare, Serve, Orchestrate, Monitor, and Optimize layers of the Reference Architecture
  • A post-mortem and CAPA system: every production incident produces a root cause, a corrective action, and a new test that prevents recurrence

Requirements

  • 10+ years in quality, reliability, or production operations for complex distributed systems — with at least some of that time governing AI or ML systems in live production
  • Direct implementation experience with AI quality frameworks — you built it, not just led a team that built it
  • Familiarity with the agentic AI quality problem: non-deterministic systems, hallucination detection, behavioral drift, autonomous decision governance
  • Working knowledge of open-source evaluation and observability frameworks (LangSmith, Arize/Phoenix, RAGAS, PromptFlow, Weights & Biases, or similar) — not just commercial alternatives
  • Background in regulated industries (financial services, telecom, healthcare, government) where AI quality failures have real contractual and commercial consequences
Benefits
  • Startup orientation: comfortable with ambiguity, iterative scope, and a team that moves faster than most people expect
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
production reliabilityAI quality frameworksautonomous evaluation pipelinesincident taxonomyrisk-tiered oversightobservability speccontinuous validation frameworknon-deterministic systemshallucination detectionbehavioral drift
Soft Skills
trust buildingoperational frameworksincident managementroot cause analysiscorrective action planning