FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.

AI Engineer, Quality – Evals
Fieldguide. Design and build a unified evaluation platform that serves as the single source of truth for all of our agentic systems and audit workflows .
Posted 4/21/2026full-timeRemote • California • 🇺🇸 United StatesJunior💰 $170,000 - $220,000 per yearWebsite
Tech Stack
Tools & technologiesPostgresPythonReactTypeScript
About the role
Key responsibilities & impact- Design and build a unified evaluation platform that serves as the single source of truth for all of our agentic systems and audit workflows
- Build observability systems that surface agent behavior, trace execution, and failure modes in production, and feedback loops that turn production failures into first-class evaluation cases
- Own the evaluation infrastructure stack including integration with LangSmith and LangGraph.
- Translate customer problems into concrete agent behaviors and workflows
- Integrate and orchestrate LLMs, tools, retrieval systems, and logic into cohesive, reliable agent experiences
- Build automated pipelines that evaluate new models against all critical workflows within hours of release
- Design evaluation harnesses for our most complex Agentic systems and workflows
- Implement comparison frameworks that measure effectiveness, consistency, latency, and cost across model versions
- Design guardrails and monitoring systems that catch quality regressions before they reach customers
- Use AI as core leverage in how you design, build, test, and iterate
- Prototype quickly to resolve uncertainty, then harden systems for enterprise-grade reliability
- Build evaluations, feedback mechanisms, and guardrails so agents improve over time
- Work with SMEs and ML Engineers to create evaluation datasets by curating production traces.
- Design prompts, retrieval pipelines, and agent orchestration systems that perform reliably at scale
- Define and document evaluation standards, best practices, and processes for the engineering organization
- Advocate for evaluation-driven development and make it easy for the team to write and run evals
- Partner with product and ML engineers to integrate evaluation requirements into agent development from day one
- Take full ownership of large product areas rather than executing on narrow tasks
Requirements
What you’ll need- Multiple years of experience shipping production software in complex, real-world systems
- Experience with TypeScript, React, Python, and Postgres
- Built and deployed LLM-powered features serving production traffic
- Implemented evaluation frameworks for model outputs and agent behaviors
- Designed observability or tracing infrastructure for AI/ML systems
- Worked with vector databases, embedding models, and RAG architectures
- Experience with evaluation platforms (LangSmith, Langfuse, or similar)
- Comfort operating in ambiguity and taking responsibility for outcomes
- Deep empathy for professional-grade, mission-critical software (experience with audit and accounting workflows are not required)
Benefits
Comp & perks- Competitive compensation packages with meaningful ownership
- Flexible PTO
- 401k
- Wellness benefits, including a bundle of free therapy sessions
- Technology & Work from Home reimbursement
- Flexible work schedules
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
TypeScriptReactPythonPostgresLLM-powered featuresevaluation frameworksobservability infrastructurevector databasesembedding modelsRAG architectures
Soft Skills
operating in ambiguityresponsibility for outcomesdeep empathy