FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.
About the role
Key responsibilities & impact- Lead the development of specialized "judge models," moving from general-purpose frontier models to architectures purpose-built for evaluation and failure mode detection.
- Design and execute rigorous scoring pipelines and empirical threshold calibrations for agentic systems, including multi-turn conversation and Graph RAG reasoning.
- Establish domain-specific evaluation frameworks that measure whether a system can perform the work of human experts rather than just passing general capability benchmarks.
- Own the full lifecycle of evaluation data, from designing annotation infrastructure and protocols to deploying evaluation services into production.
- Research and implement advanced techniques in Mixture-of-Experts (MoE) routing, expert specialization evaluation, and ensemble calibration.
- Collaborate cross-functionally with Product, Data Engineering, and the SVP of AI to translate complex statistical uncertainty into clear, actionable product signals.
- Act as a technical leader and "Scientific Conscience" within the AI pod, ensuring every AI-driven risk signal is backed by an empirical derivation story.
Requirements
What you’ll need- 10+ years of Machine Learning experience with a focus on Deep Neural Network activities, evaluating model performance & trust.
- 1-2+ years’ experience focused on post-training activities
- 1+ year experience creating benchmarks to evaluate LLMs
- Technical Mastery: Deep expertise in LLM-as-judge architectures, multi-turn evaluation, and Reinforcement Learning (RL/RLHF/RLAIF).
- Statistical Rigor: Mastery of statistics and experimental design, including significance testing, distribution analysis, and inter-rater reliability.
- Architectural Depth: Experience with Mixture-of-Experts (MoE) systems, routing behavior, and expert specialization.
- Builder Mindset: Proven ability to own the path from data collection to production deployment; we are a small team and every role is "hands-on."
- Domain Fluency: Understanding of Graph RAG and the unique challenges of evaluating non-deterministic, agentic workflows.
Benefits
Comp & perks- 100% fully paid medical, vision, and dental for employees and their dependents
- Generous time off; we observe all US federal holidays, close our office for a winter break (12/24-12/31), in addition to granting 18 PTO days and 10 sick days
- Outstanding compensation package; competitive commissions for revenue roles and bonuses for non-revenue positions
- A strong commitment to diversity, equity, and inclusion
- Eligibility to participate in additional benefits such as 401k match up to 5%, 100% paid life insurance (up to $100,000 coverage),, and parental leave
- A collaborative and positive culture - your team will be as smart and driven as you
- Limitless growth and learning opportunities
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
Machine LearningDeep Neural NetworksReinforcement LearningMixture-of-ExpertsMulti-turn evaluationStatistical analysisExperimental designBenchmark creationEvaluation frameworksEmpirical threshold calibration
Soft Skills
Technical leadershipCollaborationCommunicationProblem-solvingHands-on approach
