Hudson IT and Manpower

AI Evaluation Engineer

Hudson IT and Manpower

full-time

Posted on:

Location Type: Remote

Location: Remote • Alaska • 🇺🇸 United States

Visit company website
AI Apply
Apply

Salary

💰 $30 - $50 per hour

Job Level

Mid-LevelSenior

Tech Stack

AzurePython

About the role

  • Design and execute structured LLM evaluation (Eval) test suites to measure accuracy, relevance, safety, latency, and cost
  • Perform hands-on benchmarking and comparative analysis of Generative AI models
  • Build and maintain automated evaluation pipelines using Python
  • Create and manage datasets, benchmarks, and ground-truth references
  • Conduct structured prompt testing using Azure OpenAI and OpenAI APIs
  • Analyze hallucinations, bias, safety, and security risks in LLM outputs
  • Establish baselines and compare multiple models and prompt strategies
  • Ensure reproducibility and consistency of evaluation results
  • Document evaluation methodologies, metrics, and findings
  • Collaborate with AI/ML engineers, product teams, and stakeholders

Requirements

  • 3–5 years of experience working with AI/ML or Generative AI technologies
  • Hands-on experience evaluating and benchmarking LLMs for Generative AI
  • Experience building automated LLM evaluation pipelines using Python
  • Experience working with Azure OpenAI or OpenAI APIs
  • Experience using LLM evaluation tools such as OpenAI Evals, HuggingFace Evals, RAGAS, DeepEval, or Promptfoo
  • Experience analyzing hallucinations, bias, safety, and model performance metrics
  • Must hold a Bachelor’s or Master’s degree in Computer Science, Data Science, Artificial Intelligence, or a related field

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills
PythonLLM evaluationbenchmarkingGenerative AIautomated evaluation pipelinesdata analysisprompt testingmodel performance metricssafety analysisbias analysis
Soft skills
collaborationdocumentationanalytical thinkingproblem-solving