Design and ship a robust, end-to-end AI evaluation framework, covering offline evals, production tracing, and human-in-the-loop feedback loops, connected across all of Lattice’s AI use cases.
Define and instrument the metrics that actually matter: agent task completion, hallucination rates, response quality, user engagement, and downstream business outcomes.
Build and maintain evaluation datasets, test harnesses, and automated scoring pipelines to catch regressions before they ship.
Identify and surface the drivers of agent quality improvement, giving the team clear signals on where to invest.
Architect and implement reusable agent infrastructure: multi-turn conversation workflows, recommendation services, LLM DAGs, and standardized agent topology patterns using LangGraph.
Build and scale RAG pipelines and retrieval infrastructure, including vector store management and retrieval quality optimization.
Make principled build vs. buy decisions across LLM providers, agent frameworks, and evaluation tooling, balancing capability, cost, latency, and vendor risk.
Contribute to production AI systems with a strong focus on reliability, observability, and performance, not just prototypes.
Own projects end-to-end: scope them, drive them to completion, and bring in the right people at the right time.
Partner with engineering leads and managers to inform technical direction on agent quality and evaluation strategy you’ll be expected to hold intelligent, substantive conversations about methodology, not just implementation.
Raise the AI engineering bar across the broader team through code review, documentation, and thoughtful technical debate.

Requirements

5+ years of professional software engineering experience with significant time spent on production AI/ML systems.
Deep hands-on experience with LLM-based systems: prompt engineering, RAG pipelines, agent orchestration, evaluation metrics, and model fine-tuning.
Proven ability to work with data and understand statistics, especially in experiments.
Proven ability to build and operate agentic AI systems in production: multi-step workflows, multi-agent topologies, and the failure modes that come with them.
Strong command of AI evaluation: you’ve built eval frameworks before, you know the difference between a good eval and a vanity metric, and you have opinions about it.
Production-grade Python engineering: clean, maintainable, testable code.
LangGraph or comparable agent orchestration frameworks. You’ve built real agent workflows with it, not just tutorials.
LangSmith or comparable LLM observability tooling for tracing, evaluation, and debugging.
Reads AI papers & blogs regularly and is a trusted source of AI trends.
Vector databases (Pinecone or similar) and retrieval system design.
AWS ecosystem or other cloud infrastructure (ex GCP). Comfortable with lambdas, queues, and cloud-native architecture.
Familiarity with TypeScript is a plus.

Benefits

Medical insurance
Dental insurance
Life, AD&D, and Disability Insurance
Natural Disaster Support Program
Wellness Apps
Paid Parental Leave
Paid Time off inclusive of holidays and sick time
Working Remotely Stipend
One time WFH Office Set-Up Stipend
Retirement Plan
Financial Planning
Learning & Development Budget

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

AI evaluation frameworkproduction AI/ML systemsLLM-based systemsprompt engineeringRAG pipelinesagent orchestrationevaluation metricsmodel fine-tuningproduction-grade Python engineeringmulti-step workflows

Soft Skills

project ownershiptechnical directioncollaborationcommunicationcritical thinkingproblem-solvingcode reviewdocumentationtechnical debatedata-driven decision making