
Machine Learning Scientist
Arena
full-time
Posted on:
Location Type: Hybrid
Location: Bay Area • California • United States
Visit company websiteExplore more
Tech Stack
About the role
- Design and conduct experiments to evaluate AI model behavior across reasoning, style, robustness, and user preference dimensions.
- Develop new metrics, methodologies, and evaluation protocols that go beyond traditional benchmarks.
- Analyze large-scale human voting and interaction data to uncover insights into model performance and user preferences.
- Collaborate with engineers to implement and scale research findings into production systems.
- Prototype and test research ideas rapidly, balancing rigor with iteration speed.
- Author internal reports and external publications that contribute to the broader ML research community.
- Partner with model providers to shape evaluation questions and support responsible model testing.
- Contribute to the scientific integrity and transparency of the LMArena leaderboard and tools.
Requirements
- Hands-on experience training large-scale models, including reward models, preference models, and fine-tuning LLMs with methods like RLHF, DPO, and contrastive learning.
- Strong foundation in ML and statistics, with a track record of designing novel training objectives, evaluation schemes, or statistical frameworks to improve model reliability and alignment.
- Fluent in the full experimental stack, from dataset design and large-batch training to rigorous evaluation and ablation, with an eye for what scales to production.
- Deeply collaborative mindset, working closely with engineers to productionize research insights and iterating with product teams to align modeling goals with user needs.
- PhD or equivalent research experience in Machine Learning, Natural Language Processing, Statistics, or a related field.
- Strong understanding of LLMs and modern deep learning architectures (e.g., Transformers, diffusion models, reinforcement learning with human feedback).
- Proficiency in Python and ML research libraries such as PyTorch, JAX, or TensorFlow.
- Demonstrated ability to design and analyze experiments with statistical rigor.
- Experience publishing research or working on open-source projects in ML, NLP, or AI evaluation.
- Comfortable working with real-world usage data and designing metrics beyond standard benchmarks.
- Ability to translate research questions into practical systems and collaborate across engineering and product teams.
- Passion for open science, reproducibility, and community-driven research.
Benefits
- Comprehensive health and wellness benefits, including medical, dental, vision, and additional support programs.
- Competitive compensation and equity aligned to the markets where our team members are based.
- The opportunity to work on cutting-edge AI with a small, mission-driven team.
- A culture that values transparency, trust, and community impact.
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
training large-scale modelsreward modelspreference modelsfine-tuning LLMsRLHFDPOcontrastive learningexperimental designstatistical frameworksdeep learning architectures
Soft Skills
collaborative mindsetiterative developmentcommunicationproblem-solvingpassion for open sciencetranslating research questionsworking closely with engineersaligning modeling goalscommunity-driven researchdesigning metrics
Certifications
PhD in Machine LearningPhD in Natural Language ProcessingPhD in Statistics