
LLM Evaluation Engineer
ThirdLaw Molecular
full-time
Posted on:
Location Type: Remote
Location: United States
Visit company websiteExplore more
Tech Stack
About the role
- Build the evaluation layer in the ThirdLaw platform for LLM prompts and responses
- Design and tune guardrails, classifiers, and semantic judgment systems in real-time
- Implement evaluation strategies with semantic similarity, foundation model scoring, and rule-based systems
- Integrate model outputs with downstream enforcement actions (e.g. redaction, escalation, blocking)
- Prototype, tune, and productize small language models for classification, labeling, or scoring
- Collaborate with data infrastructure engineers to connect evaluation logic with ingestion and storage
- Build tools to observe, debug, and improve evaluator performance across data distributions
- Define abstractions for reusable evaluation components that can scale across use cases
Requirements
- 7+ years of experience in ML systems or AI engineering roles
- At least 1–2 years working directly with LLMs, NLP pipelines, or semantic search
- Deep understanding of foundation models (e.g. OpenAI, Claude, Mistral, Llama) and APIs
- Hands-on experience with vector search (e.g. FAISS, Qdrant, Weaviate) and embeddings pipelines
- Proven ability to implement real-time or near-real-time evaluation logic using semantic similarity, classifier scoring, or structured rules
- Strong in Python, with familiarity using libraries like Hugging Face Transformers, LangChain, and PyTorch or TensorFlow
- Ability to reason about model behavior, test prompt configurations, and debug complex decision logic in production
Benefits
- Generous benefits
- Market cash compensation
- Above-market equity
- Well-designed benefits
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
machine learningAI engineeringlarge language modelsnatural language processingsemantic searchfoundation modelsreal-time evaluation logicsemantic similarityclassifier scoringPython
Soft Skills
collaborationproblem-solvingdebuggingreasoningtesting