Collaborate with teams to improve model reasoning, reliability, and production readiness
Requirements
Live and breathe model evaluation, LLM safety, prompt robustness, data quality assurance, multilingual and domain-specific testing, grounding verification, and compliance/readiness checks
Experience with hallucination detection, factual consistency, prompt-injection and jailbreak resistance, bias/fairness audits, chain-of-reasoning reliability, tool-use correctness, retrieval-augmentation fidelity, and end-to-end workflow validation
Design and run test plans and regression suites, build clear rubrics and pass/fail criteria, capture reproducible error traces with root-cause hypotheses
Suggest improvements to prompt engineering, guardrails, and evaluation metrics (e.g., precision/recall, faithfulness, toxicity, and latency SLOs)
Partner on adversarial red-teaming, automation (Python/SQL), and dashboarding to track quality deltas over time
Shipped QA for ML/AI systems and safety/red-team experience
Test automation frameworks (e.g., PyTest)
Hands-on work with LLM eval tooling (e.g., OpenAI Evals, RAG evaluators, W&B)
Skills: evaluation rubric design, adversarial testing/red-teaming, regression testing at scale, bias/fairness auditing, grounding verification, prompt and system-prompt engineering, test automation (Python/SQL), and high-signal bug reporting
Clear, metacognitive communication—"showing your work"—is essential
Contractor must supply a secure computer and high-speed internet