
AI Evaluation Engineer
Hudson IT and Manpower
full-time
Posted on:
Location Type: Remote
Location: Remote • Alaska • 🇺🇸 United States
Visit company websiteSalary
💰 $30 - $50 per hour
Job Level
Mid-LevelSenior
Tech Stack
AzurePython
About the role
- Design and execute structured LLM evaluation (Eval) test suites to measure accuracy, relevance, safety, latency, and cost
- Perform hands-on benchmarking and comparative analysis of Generative AI models
- Build and maintain automated evaluation pipelines using Python
- Create and manage datasets, benchmarks, and ground-truth references
- Conduct structured prompt testing using Azure OpenAI and OpenAI APIs
- Analyze hallucinations, bias, safety, and security risks in LLM outputs
- Establish baselines and compare multiple models and prompt strategies
- Ensure reproducibility and consistency of evaluation results
- Document evaluation methodologies, metrics, and findings
- Collaborate with AI/ML engineers, product teams, and stakeholders
Requirements
- 3–5 years of experience working with AI/ML or Generative AI technologies
- Hands-on experience evaluating and benchmarking LLMs for Generative AI
- Experience building automated LLM evaluation pipelines using Python
- Experience working with Azure OpenAI or OpenAI APIs
- Experience using LLM evaluation tools such as OpenAI Evals, HuggingFace Evals, RAGAS, DeepEval, or Promptfoo
- Experience analyzing hallucinations, bias, safety, and model performance metrics
- Must hold a Bachelor’s or Master’s degree in Computer Science, Data Science, Artificial Intelligence, or a related field
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
PythonLLM evaluationbenchmarkingGenerative AIautomated evaluation pipelinesdata analysisprompt testingmodel performance metricssafety analysisbias analysis
Soft skills
collaborationdocumentationanalytical thinkingproblem-solving