
Senior AI Scientist
You.com
full-time
Posted on:
Location Type: Remote
Location: California • United States
Visit company websiteExplore more
Salary
💰 $160,000 - $200,000 per year
Job Level
About the role
- Define and own what “good” means for search-augmented and agentic AI systems by designing evaluation frameworks that measure real-world quality, reliability, and user-relevant behavior beyond standard benchmarks.
- Invent and validate novel evaluation methodologies for non-deterministic systems (LLMs, agents, RAG), including behavioral evals, long-tail and adversarial test sets, and task-specific metrics.
- Develop rigorous statistical frameworks for model comparison, regression detection, and uncertainty estimation, ensuring evaluation results are defensible and decision-ready.
- Build and maintain scalable evaluation systems—datasets, gold standards, eval harnesses, scoring pipelines, and analysis tooling—that can be reused across products and customers.
- Lead customer-facing evaluation research, working directly with enterprise customers to translate domain-specific quality requirements into credible, actionable evals that support product decisions and sales outcomes.
- Drive competitive evaluations and internal quality reviews, surfacing meaningful performance differences, trade-offs, and failure modes to inform product strategy and prioritization.
- Partner with engineering and product teams to integrate evals into development loops, release gating, and ongoing quality monitoring.
- Mentor and set standards for evaluation practice, reviewing eval designs, guiding other scientists, and shaping the long-term evals roadmap as systems become more agentic and complex.
- End-to-End Project Leadership: Lead the development of new AI-driven projects, encompassing ideation, prototyping, research, infrastructure design, scalability, monitoring, and evaluation.
- Rapid Iteration: Adapt quickly to user feedback and evolving requirements, ensuring continuous improvement in a fast-paced environment.
Requirements
- Strong grounding in applied ML and statistics, with experience evaluating non-deterministic AI systems (LLMs, agents, RAG, search).
- Deep experience with AI evaluation, including metric design, gold dataset creation, head-to-head comparisons, slicing, and error analysis.
- Statistical rigor in model comparison, using methods such as paired tests, bootstrap confidence intervals, and robustness analyses.
- Proficiency in Python for evaluation and analysis, including building eval harnesses, data pipelines, scoring logic, and reproducible analysis workflows.
- Ability to translate vague product or customer goals into measurable evaluation criteria, and to challenge metrics or conclusions that don’t reflect real quality.
- Comfort engaging directly with customers and cross-functional stakeholders, explaining evaluation results, trade-offs, and limitations clearly.
- Strong written and verbal communication, including documenting methodologies and contributing to external publications or talks.
Benefits
- Hubs in San Francisco and New York City offering regular in-person gatherings and co-working sessions
- Flexible PTO with U.S. holidays observed and a week shutdown in December to rest and recharge*
- A competitive health insurance plan covers 100% of the policyholder and 75% for dependents*
- 12 weeks of paid parental leave in the US*
- 401k program, 3% match - vested immediately!*
- $500 work-from-home stipend to be used up to a year of your start date*
- $1,200 per year Health & Wellness Allowance to support your personal goals*
- The chance to collaborate with a team at the forefront of AI research
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
applied MLstatisticsAI evaluationmetric designgold dataset creationmodel comparisonPythondata pipelinesscoring logicerror analysis
Soft Skills
communicationcustomer engagementcross-functional collaborationmentoringproject leadershipadaptabilityproblem-solvingtranslating goalscontinuous improvementdocumentation