LLM Evaluation Engineer

ThirdLaw Molecular

full-time

Posted on: 3/28/2026

Location Type: Remote

Location: United States

Visit company website

Explore more

Engineer jobs

✨ AI Apply

Apply

Job Level

Senior Lead

Tech Stack

Python PyTorch Tensorflow

About the role

Build the evaluation layer in the ThirdLaw platform for LLM prompts and responses
Design and tune guardrails, classifiers, and semantic judgment systems in real-time
Implement evaluation strategies with semantic similarity, foundation model scoring, and rule-based systems
Integrate model outputs with downstream enforcement actions (e.g. redaction, escalation, blocking)
Prototype, tune, and productize small language models for classification, labeling, or scoring
Collaborate with data infrastructure engineers to connect evaluation logic with ingestion and storage
Build tools to observe, debug, and improve evaluator performance across data distributions
Define abstractions for reusable evaluation components that can scale across use cases

Requirements

7+ years of experience in ML systems or AI engineering roles
At least 1–2 years working directly with LLMs, NLP pipelines, or semantic search
Deep understanding of foundation models (e.g. OpenAI, Claude, Mistral, Llama) and APIs
Hands-on experience with vector search (e.g. FAISS, Qdrant, Weaviate) and embeddings pipelines
Proven ability to implement real-time or near-real-time evaluation logic using semantic similarity, classifier scoring, or structured rules
Strong in Python, with familiarity using libraries like Hugging Face Transformers, LangChain, and PyTorch or TensorFlow
Ability to reason about model behavior, test prompt configurations, and debug complex decision logic in production

Benefits

Generous benefits
Market cash compensation
Above-market equity
Well-designed benefits

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

machine learningAI engineeringlarge language modelsnatural language processingsemantic searchfoundation modelsreal-time evaluation logicsemantic similarityclassifier scoringPython

Soft Skills

collaborationproblem-solvingdebuggingreasoningtesting