Define and own what “good” means for search-augmented and agentic AI systems by designing evaluation frameworks that measure real-world quality, reliability, and user-relevant behavior beyond standard benchmarks.
Invent and validate novel evaluation methodologies for non-deterministic systems (LLMs, agents, RAG), including behavioral evals, long-tail and adversarial test sets, and task-specific metrics.
Develop rigorous statistical frameworks for model comparison, regression detection, and uncertainty estimation, ensuring evaluation results are defensible and decision-ready.
Build and maintain scalable evaluation systems—datasets, gold standards, eval harnesses, scoring pipelines, and analysis tooling—that can be reused across products and customers.
Lead customer-facing evaluation research, working directly with enterprise customers to translate domain-specific quality requirements into credible, actionable evals that support product decisions and sales outcomes.
Drive competitive evaluations and internal quality reviews, surfacing meaningful performance differences, trade-offs, and failure modes to inform product strategy and prioritization.
Partner with engineering and product teams to integrate evals into development loops, release gating, and ongoing quality monitoring.
Mentor and set standards for evaluation practice, reviewing eval designs, guiding other scientists, and shaping the long-term evals roadmap as systems become more agentic and complex.
End-to-End Project Leadership: Lead the development of new AI-driven projects, encompassing ideation, prototyping, research, infrastructure design, scalability, monitoring, and evaluation.
Rapid Iteration: Adapt quickly to user feedback and evolving requirements, ensuring continuous improvement in a fast-paced environment.

Requirements

Strong grounding in applied ML and statistics, with experience evaluating non-deterministic AI systems (LLMs, agents, RAG, search).
Deep experience with AI evaluation, including metric design, gold dataset creation, head-to-head comparisons, slicing, and error analysis.
Statistical rigor in model comparison, using methods such as paired tests, bootstrap confidence intervals, and robustness analyses.
Proficiency in Python for evaluation and analysis, including building eval harnesses, data pipelines, scoring logic, and reproducible analysis workflows.
Ability to translate vague product or customer goals into measurable evaluation criteria, and to challenge metrics or conclusions that don’t reflect real quality.
Comfort engaging directly with customers and cross-functional stakeholders, explaining evaluation results, trade-offs, and limitations clearly.
Strong written and verbal communication, including documenting methodologies and contributing to external publications or talks.

Benefits

Hubs in San Francisco and New York City offering regular in-person gatherings and co-working sessions
Flexible PTO with U.S. holidays observed and a week shutdown in December to rest and recharge*
A competitive health insurance plan covers 100% of the policyholder and 75% for dependents*
12 weeks of paid parental leave in the US*
401k program, 3% match - vested immediately!*
$500 work-from-home stipend to be used up to a year of your start date*
$1,200 per year Health & Wellness Allowance to support your personal goals*
The chance to collaborate with a team at the forefront of AI research

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

applied MLstatisticsAI evaluationmetric designgold dataset creationmodel comparisonPythondata pipelinesscoring logicerror analysis

Soft Skills

communicationcustomer engagementcross-functional collaborationmentoringproject leadershipadaptabilityproblem-solvingtranslating goalscontinuous improvementdocumentation