Design and implement comprehensive scorecards and benchmarking suites for LLM-based extraction, summarization, and chat interfaces
Act as the technical lead in working with Subject Matter Experts to codify their expertise into evaluation datasets and "ground truth" labels
Design the statistical guardrails to scale both our human and automated labeling efforts
Provide clear, data-driven "Go/No-Go" recommendations for model deployment

Requirements

5+ years of experience in Data Science with a strong background in traditional statistics
2+ years of focused experience working with LLMs, specifically in evaluation, benchmarking, and prompt auditing
Master’s or PhD in Statistics, Mathematics, or a related quantitative field
Proficient in Python (Pandas, Scikit-learn, Statsmodels) and SQL
Familiarity with LLM evaluation frameworks is a major plus
Proven ability to work with non-technical SMEs to translate their qualitative feedback into quantitative metrics

Benefits

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

Data Sciencetraditional statisticsLLMsevaluationbenchmarkingprompt auditingPythonPandasScikit-learnSQL

Soft Skills

technical leadcommunicationcollaborationdata-driven decision makingtranslating qualitative feedback

Certifications

Master’s in StatisticsPhD in StatisticsPhD in Mathematics