Apply

Ready to go for it?

AI Apply speeds things up—apply directly if you prefer.

FREE ACCESS
5,000–10,000 jobs/day
JobTailor Logo

See all jobs on JobTailor

Search thousands of fresh jobs every day.

Discover
  • Fresh listings
  • Fast filters
  • No subscription required
Create a free account and start exploring right away.
Sayari

Staff Applied Scientist – AI Evaluation, Trust

Sayari

Staff Applied Scientist driving AI Evaluation & Trust for Sayari. Leading development of specialized judge models and rigorous evaluation frameworks.

Posted 4/23/2026full-timeRemote • 🇺🇸 United StatesLead💰 $195,000 - $205,000 per yearWebsite

About the role

Key responsibilities & impact
  • Lead the development of specialized "judge models," moving from general-purpose frontier models to architectures purpose-built for evaluation and failure mode detection.
  • Design and execute rigorous scoring pipelines and empirical threshold calibrations for agentic systems, including multi-turn conversation and Graph RAG reasoning.
  • Establish domain-specific evaluation frameworks that measure whether a system can perform the work of human experts rather than just passing general capability benchmarks.
  • Own the full lifecycle of evaluation data, from designing annotation infrastructure and protocols to deploying evaluation services into production.
  • Research and implement advanced techniques in Mixture-of-Experts (MoE) routing, expert specialization evaluation, and ensemble calibration.
  • Collaborate cross-functionally with Product, Data Engineering, and the SVP of AI to translate complex statistical uncertainty into clear, actionable product signals.
  • Act as a technical leader and "Scientific Conscience" within the AI pod, ensuring every AI-driven risk signal is backed by an empirical derivation story.

Requirements

What you’ll need
  • 10+ years of Machine Learning experience with a focus on Deep Neural Network activities, evaluating model performance & trust.
  • 1-2+ years’ experience focused on post-training activities
  • 1+ year experience creating benchmarks to evaluate LLMs
  • Technical Mastery: Deep expertise in LLM-as-judge architectures, multi-turn evaluation, and Reinforcement Learning (RL/RLHF/RLAIF).
  • Statistical Rigor: Mastery of statistics and experimental design, including significance testing, distribution analysis, and inter-rater reliability.
  • Architectural Depth: Experience with Mixture-of-Experts (MoE) systems, routing behavior, and expert specialization.
  • Builder Mindset: Proven ability to own the path from data collection to production deployment; we are a small team and every role is "hands-on."
  • Domain Fluency: Understanding of Graph RAG and the unique challenges of evaluating non-deterministic, agentic workflows.

Benefits

Comp & perks
  • 100% fully paid medical, vision, and dental for employees and their dependents
  • Generous time off; we observe all US federal holidays, close our office for a winter break (12/24-12/31), in addition to granting 18 PTO days and 10 sick days
  • Outstanding compensation package; competitive commissions for revenue roles and bonuses for non-revenue positions
  • A strong commitment to diversity, equity, and inclusion
  • Eligibility to participate in additional benefits such as 401k match up to 5%, 100% paid life insurance (up to $100,000 coverage),, and parental leave
  • A collaborative and positive culture - your team will be as smart and driven as you
  • Limitless growth and learning opportunities

ATS Keywords

✓ Tailor your resume
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
Machine LearningDeep Neural NetworksReinforcement LearningMixture-of-ExpertsMulti-turn evaluationStatistical analysisExperimental designBenchmark creationEvaluation frameworksEmpirical threshold calibration
Soft Skills
Technical leadershipCollaborationCommunicationProblem-solvingHands-on approach