Design and own comprehensive evaluations that measure accuracy, completeness, style, hallucination rate, bias, and safety across every release.
Tune and iterate on RAG pipelines, prompt chains, conversation loops, provider selections, and fine-tunes until quality bars are met or exceeded.
Build reusable data and evaluation pipelines, a shared semantic layer, and monitoring dashboards that make it easy for product teams to ship reliable AI quickly.
Optimize for cost and latency, continuously benchmarking models and negotiating trade-offs between performance and spend.
Implement robust data governance and lineage practices that satisfy enterprise compliance requirements and support our AI bias audit process.
Document best practices and share knowledge to raise the bar for AI development across BrightHire.
Requirements
5+ years in Data Science or ML engineering with a strong focus on ML or NLP systems.
1+ year focused on Gen-AI or LLM systems.
Strong Python and SQL skills.
Experience creating automated evaluation suites for LLM outputs (accuracy, safety, bias, tone, style) and using results to guide iterative improvements.
Knowledge of prompt engineering, RAG techniques, vector search, embeddings, fine-tuning, and model selection across multiple providers.
Ability to communicate complex AI trade-offs clearly to engineers, designers, and executives alike
Bias toward action, curiosity, and a passion for building high-quality user experiences
Benefits
15 days PTO
12 national holidays
Healthcare stipend
Work-from-home, learning, and vacation stipends
Company provided computer
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.