Apply

Ready to go for it?

AI Apply speeds things up—apply directly if you prefer.

FREE ACCESS
5,000–10,000 jobs/day
JobTailor Logo

See all jobs on JobTailor

Search thousands of fresh jobs every day.

Discover
  • Fresh listings
  • Fast filters
  • No subscription required
Create a free account and start exploring right away.
White Circle

Research Engineer, Evals

White Circle

Research Engineer at AI Safety company White Circle managing benchmarks for agent behaviors and content safety. Collaborate on core functionalities evaluation and adapt benchmarks for product features.

Posted 6/28/2026full-timeParis • 🇫🇷 FranceJunior💰 $150,000 - $250,000 per yearWebsite

Tech Stack

Tools & technologies
Python

About the role

Key responsibilities & impact
  • Own and maintain our internal benchmark suite, covering single/multi-turn content guardrails and agentic safety.
  • Build benchmarks that distinguish specific model capabilities.
  • Work with the product team to build evals covering core functionality of our flagship models.
  • Build benchmarks for new features coming out of the research team.
  • Adapt and extend evals to new verticals and changing product data.
  • Work on research projects that study and quantify realistic agentic and LLM failure modes in the wild.

Requirements

What you’ll need
  • Have built an LLM benchmark from scratch that distinguished specific model capabilities (i.e., produced a measurable, defensible capability difference, not just a score).
  • Have built synthetic data for post-training textual or multimodal models.
  • Can reproduce a published benchmark result and identify where the original methodology is fragile or misleading.
  • You write Python that other people can build on. Our whole stack is Python; we want someone who has shipped and maintained production code and who factors messy problems into clean abstractions others can extend.
  • You can write efficient LLM inference setups, including sensible orchestration of parallel calls, retries, rate-limit handling.
  • An AI power-user — fluent with frontier models and coding agents day to day.
  • A big plus: Automated red-teaming experience, Have worked across a range of agentic scaffolds and reproduced public benchmark results on them, Strong knowledge of existing reward-model / monitoring / safety benchmarks, One or more published papers in the evals / safety-evaluation space

Benefits

Comp & perks
  • Paid time off in line with your local regulations, no matter where you work from.
  • Comprehensive medical insurance for our France-based team
  • All the hardware, tools, and services you need
  • Covered subscriptions for AI agents and IDEs
  • Team off-sites twice a year: we’ve recently been to the Alps and to Saint-Tropez

ATS Keywords

✓ Tailor your resume
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
LLM Benchmark DevelopmentSynthetic Data CreationPython ProgrammingEfficient LLM Inference SetupsProduction Code Maintenance