
AI Engineer – Product
Mistral AI
full-time
Posted on:
Location Type: Hybrid
Location: Paris • 🇫🇷 France
Visit company websiteJob Level
Mid-LevelSenior
Tech Stack
PythonTypeScript
About the role
- Build and maintain an LLM evaluation framework (reference tests, heuristics, model-graded checks).
- Define and track metrics: task success, helpfulness, hallucination proxies, safety flags, latency/cost.
- Run A/B tests for prompts, models, and system prompts, analyze results, recommend rollout or rollback.
- Set up observability for LLM calls: structured logging, tracing, dashboards, alerts.
- Operate the model release: canary and shadow traffic, sign-offs, SLO-based rollback criteria, regression detection.
- Improve core behaviors: memory write/retrieve policies and evals, intent classification, follow-ups, routing, tool-call reliability.
- Create templates and docs so other teams can author evals and ship safely.
- Partner with Science, diagnose regressions and lead post-mortems.
Requirements
- Strong TypeScript or Python skills
- Production LLM experience: prompts, tool/function calling, and system prompts.
- Hands-on with evals and A/B testing, you can design metrics and make rollout decisions from data.
- Observability: logging, tracing, dashboards, alerting
- Product mindset: form hypotheses, run experiments, interpret results, iterate.
- Clear written and spoken communication, autonomous; and product-oriented.
- Now it would be ideal if you have experience with
- Safety systems: moderation, PII handling/redaction, guardrails.
- Release operations: canary/shadowing, automated rollbacks, experiment platforms.
Benefits
- 💰 Competitive salary and equity
- 🧑⚕️ Health insurance
- 🚴 Transportation allowance
- 🥎 Sport allowance
- 🥕 Meal vouchers
- 💰 Private pension plan
- 🍼 Parental : Generous parental leave policy
- 🌎 Visa sponsorship
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
TypeScriptPythonLLM evaluation frameworkA/B testingmetrics designobservabilityloggingtracingdashboardssafety systems
Soft skills
clear communicationautonomousproduct-orientedhypothesis formationexperiment iteration