Artificial Intelligence

• Design and execute structured LLM evaluation (Eval) test suites to measure accuracy, relevance, safety, latency, and cost
• Perform hands-on benchmarking and comparative analysis of Generative AI models
• Build and maintain automated evaluation pipelines using Python
• Create and manage datasets, benchmarks, and ground-truth references
• Conduct structured prompt testing using Azure OpenAI and OpenAI APIs
• Analyze hallucinations, bias, safety, and security risks in LLM outputs
• Establish baselines and compare multiple models and prompt strategies
• Ensure reproducibility and consistency of evaluation results
• Document evaluation methodologies, metrics, and findings
• Collaborate with AI/ML engineers, product teams, and stakeholders

AI Evaluation Engineer

Salary

Job Level

Tech Stack

About the role

Requirements

Applicant Tracking System Keywords

Hard skills

Soft skills