Staff Research Scientist – Reinforcement Learning

Thermo Fisher Scientific

Staff Research Scientist at Centific designing AI-driven simulation systems for enterprises and training LLM agents. Leading efforts in reinforcement learning and shaping technical direction for a talented team.

Posted 6/10/2026full-timeRemote • California • 🇺🇸 United StatesLead💰 $200,000 - $250,000 per yearWebsite

ATS Keywords

Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills

machine learningartificial intelligencereinforcement learningreward engineeringpolicy optimizationLLM post-trainingreward modelingpolicy gradient methodstemporal difference learningenvironment design

Soft Skills

mentoringtechnical directioninfluencecode reviewengineering standards

Tools & Technologies

TRLveRLOpenRLHFSkyRLGymnasium

Certifications & Qualifications

MS in Computer SciencePhD in Machine Learning

Industry Keywords

digital twinssimulation environmentsmulti-turn agentsclosed learning loopsreward hacking

Tech Stack

Tools & technologies

Python

About the role

Key responsibilities & impact

Design simulation environments and digital twins for enterprise workflows
Post-train LLM agents using RLHF, DPO, GRPO, PPO, and emerging methods
Build pipelines that convert human-labeled traces and verifiable signals into training data
Architect multi-turn, tool-using agents with closed learning loops
Design reward functions and verifiers that resist reward hacking and reflect real task outcomes
Set the technical bar across the team — architecture, code review, engineering standards
Mentor researchers and engineers; drive technical direction through influence
Translate research into production; contribute to publications

Requirements

What you’ll need

7+ years in ML/AI research or engineering; 3+ years at senior/staff level
MS or PhD in Computer Science, Machine Learning, or related field (or equivalent)
5+ years hands-on RL — environment design, reward engineering, policy optimization — with at least one production deployment LLM Post-Training
3+ years fine-tuning LLMs with hands-on RL post-training (RLHF, DPO, GRPO, PPO)
Expert-level implementation of RLHF pipelines, reward modeling (Bradley-Terry), DPO, and KTO
Working knowledge of modern post-training and rollout-serving libraries (TRL, veRL, OpenRLHF, SkyRL)
Experience building LLM-based agents: tool use, multi-turn reasoning, trajectory evaluation
Strong Python and software engineering skills — comfortable building production pipelines, not just notebooks
Deep expertise in MDPs, policy gradient methods (PPO, SAC), and temporal difference learning
Hands-on experience with Gymnasium-based environments and reward engineering (sparse vs. dense)

Benefits

Comp & perks

N/A 📊 Check your resume score for this job Improve your chances of getting an interview by checking your resume score before you apply. Check Resume Score