Staff Applied Scientist – AI Evaluation, Trust

Sayari

Staff Applied Scientist driving AI Evaluation & Trust for Sayari. Leading development of specialized judge models and rigorous evaluation frameworks.

Posted 4/23/2026full-timeRemote • 🇺🇸 United StatesLead💰 $195,000 - $205,000 per yearWebsite

About the role

Key responsibilities & impact

Lead the development of specialized "judge models," moving from general-purpose frontier models to architectures purpose-built for evaluation and failure mode detection.
Design and execute rigorous scoring pipelines and empirical threshold calibrations for agentic systems, including multi-turn conversation and Graph RAG reasoning.
Establish domain-specific evaluation frameworks that measure whether a system can perform the work of human experts rather than just passing general capability benchmarks.
Own the full lifecycle of evaluation data, from designing annotation infrastructure and protocols to deploying evaluation services into production.
Research and implement advanced techniques in Mixture-of-Experts (MoE) routing, expert specialization evaluation, and ensemble calibration.
Collaborate cross-functionally with Product, Data Engineering, and the SVP of AI to translate complex statistical uncertainty into clear, actionable product signals.
Act as a technical leader and "Scientific Conscience" within the AI pod, ensuring every AI-driven risk signal is backed by an empirical derivation story.

Requirements

What you’ll need

10+ years of Machine Learning experience with a focus on Deep Neural Network activities, evaluating model performance & trust.
1-2+ years’ experience focused on post-training activities
1+ year experience creating benchmarks to evaluate LLMs
Technical Mastery: Deep expertise in LLM-as-judge architectures, multi-turn evaluation, and Reinforcement Learning (RL/RLHF/RLAIF).
Statistical Rigor: Mastery of statistics and experimental design, including significance testing, distribution analysis, and inter-rater reliability.
Architectural Depth: Experience with Mixture-of-Experts (MoE) systems, routing behavior, and expert specialization.
Builder Mindset: Proven ability to own the path from data collection to production deployment; we are a small team and every role is "hands-on."
Domain Fluency: Understanding of Graph RAG and the unique challenges of evaluating non-deterministic, agentic workflows.

Benefits

Comp & perks

100% fully paid medical, vision, and dental for employees and their dependents
Generous time off; we observe all US federal holidays, close our office for a winter break (12/24-12/31), in addition to granting 18 PTO days and 10 sick days
Outstanding compensation package; competitive commissions for revenue roles and bonuses for non-revenue positions
A strong commitment to diversity, equity, and inclusion
Eligibility to participate in additional benefits such as 401k match up to 5%, 100% paid life insurance (up to $100,000 coverage),, and parental leave
A collaborative and positive culture - your team will be as smart and driven as you
Limitless growth and learning opportunities

ATS Keywords

✓ Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

Machine LearningDeep Neural NetworksReinforcement LearningMixture-of-ExpertsMulti-turn evaluationStatistical analysisExperimental designBenchmark creationEvaluation frameworksEmpirical threshold calibration

Soft Skills

Technical leadershipCollaborationCommunicationProblem-solvingHands-on approach