Principal ML Engineer

Red Hat

full-time

Posted on: 9/20/2025

Origin: • 🇺🇸 United States • Massachusetts, North Carolina

✨ AI Apply

💰 $189,600 - $312,730 per year

Lead

PythonPyTorch

About the role

Architect and lead development of large-scale evaluation platforms for LLMs and agents, enabling automated, reproducible, and extensible assessment
Define organizational standards and metrics for LLM/agent evaluation covering hallucination detection, factuality, bias, robustness, interpretability, and alignment drift
Build platform components and APIs to allow product teams to integrate evaluation into training, fine-tuning, deployment, and continuous monitoring workflows
Design automated pipelines and benchmarks for adversarial testing, red-teaming, and stress testing of LLMs and RAG systems
Lead initiatives in multi-dimensional evaluation including safety, grounding, and agent behaviors
Collaborate with cross-functional stakeholders to translate evaluation goals into measurable system-level frameworks
Advance interpretability and observability tooling to understand, debug, and explain LLM behaviors in production
Mentor engineers and establish best practices driving adoption of evaluation-driven development
Influence technical roadmaps and represent the team’s evaluation-first approach in external forums and publications

10+ years of ML engineering experience
3+ years focused on large-scale evaluation of transformer-based LLMs and/or agentic systems
Proven experience building evaluation platforms or frameworks that operate across training, deployment, and post-deployment contexts
Deep expertise in designing and implementing LLM evaluation metrics (factuality, hallucination detection, grounding, toxicity, robustness)
Strong background in scalable platform engineering, including APIs, pipelines, and integrations used by multiple product teams
Demonstrated ability to bridge research and engineering, operationalizing safety and alignment techniques into production evaluation systems
Proficiency in Python, PyTorch, Hugging Face, and modern ML ops/deployment environments
Track record of technical leadership, including mentoring, architecture design, and defining org-wide practices
Experience with multi-agent evaluation frameworks and graph-based metrics for agent interactions (preferred)
Background in retrieval-augmented generation (RAG) evaluation (retrieval precision/recall, grounding, attribution) (preferred)
Contributions to AI safety or evaluation research in industry or academia (preferred)
Familiarity with adversarial testing methodologies and automated red-teaming (preferred)
Knowledge of interpretability and transparency methods for LLMs (preferred)
Advanced degree in ML/CS or related field with focus on evaluation, safety, or interpretability (preferred)