Apply

Ready to go for it?

AI Apply speeds things up—apply directly if you prefer.

FREE ACCESS
5,000–10,000 jobs/day
JobTailor Logo

See all jobs on JobTailor

Search thousands of fresh jobs every day.

Discover
  • Fresh listings
  • Fast filters
  • No subscription required
Create a free account and start exploring right away.
Guild.ai

AI Engineer, Agents – Evaluation

Guild.ai

AI Engineer specializing in design and implementation of evaluation frameworks for AI agents. Collaborating with cross-functional teams to enhance agent quality and performance with state-of-the-art solutions.

Posted 4/12/2026full-timeSan Francisco • California • 🇺🇸 United StatesJuniorMid-LevelWebsite

Tech Stack

Tools & technologies
PythonSparkTypeScript

About the role

Key responsibilities & impact
  • Create Task Evaluations That Matter: Design and implement task-specific evaluations that measure and improve agent quality. Each eval should both drive concrete iteration on our agents and spark broader innovation around the task itself.
  • Define Tasks, Datasets, and Harnesses: Clearly specify tasks, collect and curate balanced datasets, and build robust evaluation harnesses that can be used across agents and modeling approaches. There is ample room for architectural design and systems thinking here.
  • Build and Use a Reusable Evaluation Framework: Develop frameworks and tools for running evaluations at scale. Use these frameworks to tune existing agents and to guide the development of new ones in our environment.
  • Explore Agent Orchestration Strategies: Investigate and implement orchestration patterns (tooling, routing, decomposition, multi-agent setups, etc.) that allow agents to tackle increasingly complex, multi-step, and long-horizon tasks.
  • Apply Post-Training Techniques: Experiment with post-training approaches (e.g., fine-tuning, preference optimization, reward shaping, distillation) to produce high-performance models tailored to specific tasks and workflows.
  • Run Experiments End-to-End: Design, run, and analyze experiments with rigor. Turn experimental results into clear recommendations and concrete changes to model configurations, prompts, and system design.
  • Collaborate Deeply Across the Stack: Work closely with founders, product, and infrastructure engineers to ensure evaluations, agents, and platform primitives all reinforce each other.

Requirements

What you’ll need
  • MS or Ph.D. in a relevant field (e.g., Computer Science, Machine Learning, NLP) or equivalent practical experience
  • Strong background in machine learning and large language models, ideally including both research and hands-on implementation
  • 2–5 years working with LLM technology, with familiarity across:
  • Prompting and interaction patterns
  • Agent and tool orchestration strategies
  • Evaluation strategies for complex, open-ended tasks
  • Proficiency writing production-quality code, especially in Python; comfort working with TypeScript or modern web/backend stacks
  • Experience designing and running experiments, and interpreting results in messy, real-world settings
  • Self-motivated, comfortable operating in an unstructured, high-ambiguity environment
  • Strong communication skills and the ability to translate vague goals into concrete, testable setups.

Benefits

Comp & perks
  • Significant equity in an early-stage, venture-backed startup
  • Comprehensive Health Benefits (Medical, Dental, Vision)
  • Flexible PTO to ensure you have the time you need to recharge

ATS Keywords

✓ Tailor your resume
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
machine learninglarge language modelsPythonTypeScriptevaluation strategiespost-training techniquesexperiment designdata curationarchitectural designsystems thinking
Soft Skills
self-motivatedstrong communicationcollaborationproblem-solvingadaptabilityanalytical thinkinginnovationrigortranslating goalsoperating in ambiguity
Certifications
MS in Computer SciencePh.D. in relevant field