Partner with downstream AI application teams to define shared evaluations that codify application expectations of LLMs and other foundation models, ensuring progress can be transparently tracked against real-world needs.
Design rigorous benchmarks and evaluation methodologies across ranking & recommendations, content understanding, and language/text generation — grounded in a deep technical understanding of LLMs, their strengths, limitations, and failure modes.
Lead the development of evaluators and strong baselines to ensure in-house LLMs and other foundation models demonstrate clear advantages over off-the-shelf alternatives.
Build scalable, reproducible data and evaluation systems that make dataset creation and evaluation design as nimble and experiment-friendly as model development itself.
Hire, grow, and nurture a world-class team, fostering an inclusive, high-performing culture that balances research innovation with engineering excellence.
Work closely with the teams developing Netflix’s foundation models (including our core LLM) to ensure evaluation and data insights are folded back into the cadence of model development.
Proactively influence the ML Platform and Data Engineering teams at key interfaces.
Requirements
8+ years of overall experience, including 3+ years in engineering management.
Experience with large-scale ML systems and foundation models, especially LLMs.
Strong technical expertise in LLMs, their evaluation, and practical methods for ensuring robustness, reproducibility, and quality.
Broad knowledge of machine learning fundamentals and evaluation methodologies, including benchmark design, model-based evaluators, and offline/online metrics.
Experience driving cross-functional projects, including close collaboration with AI application teams to translate product needs into evaluation frameworks.
Excellent written and verbal communication skills, able to bridge technical and non-technical audiences.
Advanced degree in Computer Science, Statistics, or a related quantitative field.
Benefits
Health Plans
Mental Health support
401(k) Retirement Plan with employer match
Stock Option Program
Disability Programs
Health Savings and Flexible Spending Accounts
Family-forming benefits
Life and Serious Injury Benefits
Paid leave of absence programs
Flexible time off for salaried employees
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
large-scale ML systemsfoundation modelsLLMsevaluation methodologiesbenchmark designmodel-based evaluatorsoffline metricsonline metricsrobustnessreproducibility