Invisible Technologies

AI QA Trainer, LLM Evaluation

Invisible Technologies

contract

Posted on:

Origin:  • 🌎 Anywhere in the World

Visit company website
AI Apply
Apply

Salary

💰 $6 - $65 per hour

Job Level

Mid-LevelSenior

Tech Stack

PythonSQL

About the role

  • Converse with LLMs on real-world scenarios and evaluation prompts to verify factual accuracy and logical soundness
  • Design and run test plans, regression suites, and evaluation rubrics with pass/fail criteria
  • Document failures, capture reproducible error traces, and provide root-cause hypotheses
  • Suggest improvements to prompt engineering, guardrails, and evaluation metrics (precision/recall, faithfulness, toxicity, latency SLOs)
  • Partner on adversarial red-teaming, automation (Python/SQL), and dashboarding to track quality deltas over time
  • Challenge models on hallucination detection, factual consistency, prompt-injection/jailbreak resistance, bias/fairness audits, chain-of-reasoning reliability, tool-use correctness, retrieval-augmentation fidelity, and end-to-end workflow validation
  • Collaborate with teams to improve model reasoning, reliability, and production readiness

Requirements

  • Live and breathe model evaluation, LLM safety, prompt robustness, data quality assurance, multilingual and domain-specific testing, grounding verification, and compliance/readiness checks
  • Experience with hallucination detection, factual consistency, prompt-injection and jailbreak resistance, bias/fairness audits, chain-of-reasoning reliability, tool-use correctness, retrieval-augmentation fidelity, and end-to-end workflow validation
  • Design and run test plans and regression suites, build clear rubrics and pass/fail criteria, capture reproducible error traces with root-cause hypotheses
  • Suggest improvements to prompt engineering, guardrails, and evaluation metrics (e.g., precision/recall, faithfulness, toxicity, and latency SLOs)
  • Partner on adversarial red-teaming, automation (Python/SQL), and dashboarding to track quality deltas over time
  • Shipped QA for ML/AI systems and safety/red-team experience
  • Test automation frameworks (e.g., PyTest)
  • Hands-on work with LLM eval tooling (e.g., OpenAI Evals, RAG evaluators, W&B)
  • Skills: evaluation rubric design, adversarial testing/red-teaming, regression testing at scale, bias/fairness auditing, grounding verification, prompt and system-prompt engineering, test automation (Python/SQL), and high-signal bug reporting
  • Clear, metacognitive communication—"showing your work"—is essential
  • Contractor must supply a secure computer and high-speed internet
Leidos

Network Management Engineer

Leidos
Mid · Seniorfull-time$72k–$130k / year🇺🇸 United States
Posted: 18 days agoSource: leidos.wd5.myworkdayjobs.com
PythonSQL
Hastings Direct

Senior Data Scientist

Hastings Direct
Seniorfull-time🇬🇧 United Kingdom
Posted: 19 days agoSource: hastingsdirect.wd3.myworkdayjobs.com
PythonSQL
M&T Bank

Quantitative Manager, Credit Model Development – Consumer Credit

M&T Bank
Mid · Seniorfull-time$116k–$193k / yearNew York · 🇺🇸 United States
Posted: 14 days agoSource: mtb.wd5.myworkdayjobs.com
PythonSQL
Holland & Knight LLP

Senior Cloud Engineer

Holland & Knight LLP
Seniorfull-timeFlorida · 🇺🇸 United States
Posted: 14 days agoSource: hklaw.wd1.myworkdayjobs.com
AnsibleApacheAWSAzureCloudCyber SecurityGoogle Cloud PlatformPythonSDLCSparkSQLTerraform+1 more
US LBM

Cybersecurity Engineer, DevSecOps, Cloud Security

US LBM
Mid · Seniorfull-time🇺🇸 United States
Posted: 19 days agoSource: uslbm.wd1.myworkdayjobs.com
AzureCloudCyber SecurityPythonSQLTerraform