Research Engineer, Evals

White Circle

Research Engineer at AI Safety company White Circle managing benchmarks for agent behaviors and content safety. Collaborate on core functionalities evaluation and adapt benchmarks for product features.

Posted 6/28/2026full-timeParis • 🇫🇷 FranceJunior💰 $150,000 - $250,000 per yearWebsite

Tech Stack

Tools & technologies

Python

About the role

Key responsibilities & impact

Own and maintain our internal benchmark suite, covering single/multi-turn content guardrails and agentic safety.
Build benchmarks that distinguish specific model capabilities.
Work with the product team to build evals covering core functionality of our flagship models.
Build benchmarks for new features coming out of the research team.
Adapt and extend evals to new verticals and changing product data.
Work on research projects that study and quantify realistic agentic and LLM failure modes in the wild.

Requirements

What you’ll need

Have built an LLM benchmark from scratch that distinguished specific model capabilities (i.e., produced a measurable, defensible capability difference, not just a score).
Have built synthetic data for post-training textual or multimodal models.
Can reproduce a published benchmark result and identify where the original methodology is fragile or misleading.
You write Python that other people can build on. Our whole stack is Python; we want someone who has shipped and maintained production code and who factors messy problems into clean abstractions others can extend.
You can write efficient LLM inference setups, including sensible orchestration of parallel calls, retries, rate-limit handling.
An AI power-user — fluent with frontier models and coding agents day to day.
A big plus: Automated red-teaming experience, Have worked across a range of agentic scaffolds and reproduced public benchmark results on them, Strong knowledge of existing reward-model / monitoring / safety benchmarks, One or more published papers in the evals / safety-evaluation space

Benefits

Comp & perks

Paid time off in line with your local regulations, no matter where you work from.
Comprehensive medical insurance for our France-based team
All the hardware, tools, and services you need
Covered subscriptions for AI agents and IDEs
Team off-sites twice a year: we’ve recently been to the Alps and to Saint-Tropez

ATS Keywords

✓ Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

LLM Benchmark DevelopmentSynthetic Data CreationPython ProgrammingEfficient LLM Inference SetupsProduction Code Maintenance