Salary
💰 $189,600 - $312,730 per year
About the role
- Architect and lead development of large-scale evaluation platforms for LLMs and agents, enabling automated, reproducible, and extensible assessment
- Define organizational standards and metrics for LLM/agent evaluation covering hallucination detection, factuality, bias, robustness, interpretability, and alignment drift
- Build platform components and APIs to allow product teams to integrate evaluation into training, fine-tuning, deployment, and continuous monitoring workflows
- Design automated pipelines and benchmarks for adversarial testing, red-teaming, and stress testing of LLMs and RAG systems
- Lead initiatives in multi-dimensional evaluation including safety, grounding, and agent behaviors
- Collaborate with cross-functional stakeholders to translate evaluation goals into measurable system-level frameworks
- Advance interpretability and observability tooling to understand, debug, and explain LLM behaviors in production
- Mentor engineers and establish best practices driving adoption of evaluation-driven development
- Influence technical roadmaps and represent the team’s evaluation-first approach in external forums and publications
Requirements
- 10+ years of ML engineering experience
- 3+ years focused on large-scale evaluation of transformer-based LLMs and/or agentic systems
- Proven experience building evaluation platforms or frameworks that operate across training, deployment, and post-deployment contexts
- Deep expertise in designing and implementing LLM evaluation metrics (factuality, hallucination detection, grounding, toxicity, robustness)
- Strong background in scalable platform engineering, including APIs, pipelines, and integrations used by multiple product teams
- Demonstrated ability to bridge research and engineering, operationalizing safety and alignment techniques into production evaluation systems
- Proficiency in Python, PyTorch, Hugging Face, and modern ML ops/deployment environments
- Track record of technical leadership, including mentoring, architecture design, and defining org-wide practices
- Experience with multi-agent evaluation frameworks and graph-based metrics for agent interactions (preferred)
- Background in retrieval-augmented generation (RAG) evaluation (retrieval precision/recall, grounding, attribution) (preferred)
- Contributions to AI safety or evaluation research in industry or academia (preferred)
- Familiarity with adversarial testing methodologies and automated red-teaming (preferred)
- Knowledge of interpretability and transparency methods for LLMs (preferred)
- Advanced degree in ML/CS or related field with focus on evaluation, safety, or interpretability (preferred)