AI QA Trainer, LLM Evaluation

Invisible Technologies

contract

Posted on: 9/18/2025

Origin: • 🌎 Anywhere in the World

✨ AI Apply

💰 $6 - $65 per hour

Mid-LevelSenior

PythonSQL

About the role

Converse with LLMs on real-world scenarios and evaluation prompts to verify factual accuracy and logical soundness
Design and run test plans, regression suites, and evaluation rubrics with pass/fail criteria
Document failures, capture reproducible error traces, and provide root-cause hypotheses
Suggest improvements to prompt engineering, guardrails, and evaluation metrics (precision/recall, faithfulness, toxicity, latency SLOs)
Partner on adversarial red-teaming, automation (Python/SQL), and dashboarding to track quality deltas over time
Challenge models on hallucination detection, factual consistency, prompt-injection/jailbreak resistance, bias/fairness audits, chain-of-reasoning reliability, tool-use correctness, retrieval-augmentation fidelity, and end-to-end workflow validation
Collaborate with teams to improve model reasoning, reliability, and production readiness

Live and breathe model evaluation, LLM safety, prompt robustness, data quality assurance, multilingual and domain-specific testing, grounding verification, and compliance/readiness checks
Experience with hallucination detection, factual consistency, prompt-injection and jailbreak resistance, bias/fairness audits, chain-of-reasoning reliability, tool-use correctness, retrieval-augmentation fidelity, and end-to-end workflow validation
Design and run test plans and regression suites, build clear rubrics and pass/fail criteria, capture reproducible error traces with root-cause hypotheses
Suggest improvements to prompt engineering, guardrails, and evaluation metrics (e.g., precision/recall, faithfulness, toxicity, and latency SLOs)
Partner on adversarial red-teaming, automation (Python/SQL), and dashboarding to track quality deltas over time
Shipped QA for ML/AI systems and safety/red-team experience
Test automation frameworks (e.g., PyTest)
Hands-on work with LLM eval tooling (e.g., OpenAI Evals, RAG evaluators, W&B)
Skills: evaluation rubric design, adversarial testing/red-teaming, regression testing at scale, bias/fairness auditing, grounding verification, prompt and system-prompt engineering, test automation (Python/SQL), and high-signal bug reporting
Clear, metacognitive communication—"showing your work"—is essential
Contractor must supply a secure computer and high-speed internet