Distyl AI

AI Evaluation Engineer

Distyl AI

full-time

Posted on:

Location Type: Hybrid

Location: San FranciscoCaliforniaNew YorkUnited States

Visit company website

Explore more

AI Apply
Apply

Salary

💰 $130,000 - $250,000 per year

Tech Stack

About the role

  • Design and implement evaluation frameworks that enable Evaluation-Driven Development for AI systems deployed in customer environments
  • Define how system quality is measured in each domain, ensuring that evaluation signals reflect real user needs, domain constraints, and business objectives
  • Build and maintain golden test cases and regression suites in Python, using both human-authored and AI-assisted test generation to capture critical behaviors and edge cases. These test suites are treated as first-class system components that evolve alongside the AI system itself
  • Develop and maintain evaluation pipelines—offline and online—that integrate directly into system iteration loops. Evaluation results inform prompt design, agent logic, model selection, and release readiness, ensuring that system changes are driven by measurable improvements rather than intuition alone
  • Define, calibrate, and operate LLM-based graders, aligning automated judgments with expert human assessments. They investigate where evaluation signals diverge from real-world outcomes and refine grading approaches to maintain signal quality as systems and domains evolve
  • Work closely with Forward Deployed AI Engineers, Architects, Product Engineers, AI Strategists, and domain experts to ensure evaluation frameworks meaningfully guide system development and deployment in production

Requirements

  • 2+ years of software engineering experience
  • Strong Python Engineering Skills: Write clean, maintainable Python and are comfortable building evaluation and experimentation pipelines that run in production environments
  • Experience with Evaluation-Driven or Experiment-Driven Development: Experience using structured evaluation or experimentation frameworks to drive system iteration, and understand the pitfalls of overfitting to metrics that don’t reflect real outcomes
  • Ability to Translate Human Judgment into Code: Work with subject matter experts to elicit high-quality judgments and encode them into test cases, scoring functions, and graders that scale
  • Systems-Oriented Mindset: Understand how evaluation interacts with prompts, agents, data, and deployment. You design evaluation systems that support fast iteration while maintaining trust and safety in production
  • AI-Native Working Style: Use AI tools to generate tests, analyze failures, explore edge cases, and accelerate debugging and iteration
  • Travel: Ability to travel 25-50%
Benefits
  • 100% covered medical, dental, and vision for employees and dependents
  • 401(k) with additional perks (e.g., commuter benefits, in‑office lunch)
  • Access to state‑of‑the‑art models, generous usage of modern AI tools, and real‑world business problems
  • Ownership of high‑impact projects across top enterprises
  • A mission‑driven, fast‑moving culture that prizes curiosity, pragmatism, and excellence
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
PythonEvaluation-Driven DevelopmentExperiment-Driven DevelopmentTest case developmentRegression testingEvaluation frameworksGrading systemsAI-assisted test generationEvaluation pipelinesSystems design
Soft Skills
CollaborationCommunicationProblem-solvingAdaptabilityCritical thinkingAttention to detailSystems-oriented mindsetAbility to translate human judgment into code