Leads the technical evaluation and assurance efforts within our AI Governance team
Establishes enterprise-grade, decision-relevant methodologies for red teaming, testing, and evaluating AI systems across traditional ML, Generative AI, and Agentic AI applications
Develops reproducible frameworks to measure AI value, user impact, and broader outcomes
Designs rigorous evaluation methodologies for assessing AI system performance, safety, reliability, and alignment with intended use across the AI lifecycle
Develops criteria and benchmarks to determine whether existing evaluations are adequate and sufficient for different AI applications and risk profiles
Designs and executes comprehensive red team exercises to identify vulnerabilities, failure modes, and unintended behaviors across diverse AI systems
Establishes standards for evaluation coverage, rigor, and documentation across the AI lifecycle
Advances the scientific understanding of AI evaluation and safety through white papers and trainings
Provides technical leadership and mentorship to scientists, engineers, and compliance professionals while building organizational evaluation capabilities
Stays at the forefront of AI safety research and identify novel risks emerging from advanced AI capabilities
Translates complex technical findings into actionable recommendations for leadership, governance boards, and cross-functional teams
Collaborates with external researchers, institutions, and industry partners to advance evaluation methodology

Requirements

Bachelor's Degree Computer Science, Machine Learning, Statistics, or related quantitative field; or equivalent experience required
Master's Degree preferred
6+ years AI/ML research, including 3+ years focused on model evaluation, safety, or robustness required
8+ years preferred
Deep technical expertise in modern AI systems with hands-on experience evaluating large language models, generative AI, and/or agentic systems required
Proven track record designing rigorous evaluation methodologies and publication record required
Strong foundation in statistical methods, experimental design, causal inference, and excellent Python programming skills with ML frameworks required
Familiarity with Python‑based AI/ML stack using PyTorch and Databricks, with agentic AI frameworks (LangChain, LlamaIndex, LangGraph, AutoGen, CrewAI) for single‑ and multi‑agent systems
Strong focus on LLM observability, MLOps, and evaluation using LangSmith, MLflow, Weights & Biases, Datadog, OpenTelemetry, and testing frameworks like DeepEval and LangTest

Benefits

competitive pay
health insurance
401K and stock purchase plans
tuition reimbursement
paid time off plus holidays
a flexible approach to work with remote, hybrid, field or office work schedules

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

AI evaluation methodologiesmodel evaluationsafetyrobustnessstatistical methodsexperimental designcausal inferencePython programminglarge language modelsgenerative AI

Soft Skills

technical leadershipmentorshipcollaborationcommunicationorganizational capabilities