Anthropic

Senior Research Scientist, Reward Models

Anthropic

full-time

Posted on:

Location Type: Hybrid

Location: San Francisco • California • 🇺🇸 United States

Visit company website
AI Apply
Apply

Salary

💰 $340,000 - $425,000 per year

Job Level

Senior

About the role

  • Lead research on novel reward model architectures and training approaches for RLHF
  • Develop and evaluate LLM-based grading and evaluation methods, including rubric-driven approaches that improve consistency and interpretability
  • Research techniques to detect, characterize, and mitigate reward hacking and specification gaming
  • Design experiments to understand reward model generalization, robustness, and failure modes
  • Collaborate with the Finetuning team to translate research insights into improvements for production training pipelines
  • Contribute to research publications, blog posts, and internal documentation
  • Mentor other researchers and help build institutional knowledge around reward modeling

Requirements

  • A track record of research contributions in reward modeling, RLHF, or closely related areas of machine learning
  • Experience training and evaluating reward models for large language models
  • Comfortable designing and running large-scale experiments with significant computational resources
  • Work effectively across research and engineering, iterating quickly while maintaining scientific rigor
  • Enjoy collaborative research and can communicate complex ideas clearly to diverse audiences
  • Care deeply about building AI systems that are both highly capable and safe.
  • Strong candidates may also have published research on reward modeling, preference learning, or RLHF
  • Experience with LLM-as-judge approaches including calibration and reliability challenges
  • Worked on reward hacking, specification gaming, or related robustness problems
  • Experience with constitutional AI, debate, or other scalable oversight approaches
  • Contributed to production ML systems at scale
  • Familiarity with interpretability techniques as applied to understanding reward model behavior.
Benefits
  • Competitive compensation and benefits
  • Optional equity donation matching
  • Generous vacation and parental leave
  • Flexible working hours

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills
reward modelingRLHFlarge language modelstraining modelsevaluating modelsdesigning experimentsinterpretability techniquescalibrationreliability challengesconstitutional AI
Soft skills
collaborative researchcommunicationmentoringscientific rigoriterative developmentbuilding institutional knowledgeproblem-solvingclear idea presentationcross-functional teamworkpassion for AI safety