Converse with the model on real-world scenarios and evaluation prompts
Verify factual accuracy and logical soundness
Design and run test plans and regression suites
Build clear rubrics and pass/fail criteria
Capture reproducible error traces with root-cause hypotheses
Suggest improvements to prompt engineering, guardrails, and evaluation metrics (e.g., precision/recall, faithfulness, toxicity, and latency SLOs)
Partner on adversarial red-teaming, automation (Python/SQL), and dashboarding to track quality deltas over time

Requirements

Bachelor’s, master’s, or PhD in computer science, data science, computational linguistics, statistics, or a related field is ideal
Shipped QA for ML/AI systems
Safety/red-team experience
Test automation frameworks (e.g., PyTest)
Hands-on work with LLM eval tooling (e.g., OpenAI Evals, RAG evaluators, W&B)
Skills that stand out include: evaluation rubric design, adversarial testing/red-teaming, regression testing at scale, bias/fairness auditing, grounding verification, prompt and system-prompt engineering, test automation (Python/SQL), and high-signal bug reporting
Clear, metacognitive communication—“showing your work”—is essential.

Benefits

Company-sponsored benefits such as health insurance do not apply
You’ll supply a secure computer and high-speed internet

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

PythonSQLtest automationevaluation rubric designadversarial testingregression testingbias auditinggrounding verificationprompt engineeringhigh-signal bug reporting

Soft Skills

metacognitive communicationcollaborationcritical thinkingproblem-solving