ThirdLaw Molecular

LLM Evaluation Engineer

ThirdLaw Molecular

full-time

Posted on:

Location Type: Remote

Location: United States

Visit company website

Explore more

AI Apply
Apply

Job Level

About the role

  • Build the evaluation layer in the ThirdLaw platform for LLM prompts and responses
  • Design and tune guardrails, classifiers, and semantic judgment systems in real-time
  • Implement evaluation strategies with semantic similarity, foundation model scoring, and rule-based systems
  • Integrate model outputs with downstream enforcement actions (e.g. redaction, escalation, blocking)
  • Prototype, tune, and productize small language models for classification, labeling, or scoring
  • Collaborate with data infrastructure engineers to connect evaluation logic with ingestion and storage
  • Build tools to observe, debug, and improve evaluator performance across data distributions
  • Define abstractions for reusable evaluation components that can scale across use cases

Requirements

  • 7+ years of experience in ML systems or AI engineering roles
  • At least 1–2 years working directly with LLMs, NLP pipelines, or semantic search
  • Deep understanding of foundation models (e.g. OpenAI, Claude, Mistral, Llama) and APIs
  • Hands-on experience with vector search (e.g. FAISS, Qdrant, Weaviate) and embeddings pipelines
  • Proven ability to implement real-time or near-real-time evaluation logic using semantic similarity, classifier scoring, or structured rules
  • Strong in Python, with familiarity using libraries like Hugging Face Transformers, LangChain, and PyTorch or TensorFlow
  • Ability to reason about model behavior, test prompt configurations, and debug complex decision logic in production
Benefits
  • Generous benefits
  • Market cash compensation
  • Above-market equity
  • Well-designed benefits
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
machine learningAI engineeringlarge language modelsnatural language processingsemantic searchfoundation modelsreal-time evaluation logicsemantic similarityclassifier scoringPython
Soft Skills
collaborationproblem-solvingdebuggingreasoningtesting