Salary
💰 $255,000 - $325,000 per year
About the role
- Define the core evaluation signals that drive model improvement at OpenAI, turning vague product gaps into crisp, defensible measures of quality
- Design agents, harnesses, and eval pipelines that are reliable, reproducible, and extendable
- Prototype solutions with real workflows and convert them into scalable feedback loops
- Connect evaluation signals directly to research and training systems so product improvements show up in what users experience
- Evaluate multi-turn and tool-using systems, design agent harnesses, and apply reinforcement learning and related methods in production settings
- Collaborate closely with research and product teams and work across the stack, from backend pipelines to user-facing interfaces
- Build reusable systems and tools that enable contributions from across the company and steadily raise the quality bar
- Operate like a founder or founding engineer: take initiative, move quickly, and create structure where none exists
Requirements
- 4+ years of experience in software engineering with strong fundamentals and a track record of shipping production systems end-to-end
- Experience building AI agents or applications, including designing evals and improving performance through prompting or scaffolding
- Familiarity with evaluation methods for LLMs and patterns like multi-agent workflows, tool use, or long context
- Familiarity with deep learning concepts or prior exposure to training models
- Strong communication across technical and non-technical audiences
- Motivated by high-impact collaboration with research and product teams and thrive in ambiguity
- Ability to work from San Francisco office three days per week (hybrid)
- Willingness to prototype with users and build reusable systems and tools