
Senior Prompt and Benchmark Engineer, Evaluation of World Models
NVIDIA
full-time
Posted on:
Location Type: Remote
Location: Remote • California • 🇺🇸 United States
Visit company websiteSalary
💰 $184,000 - $356,500 per year
Job Level
Senior
About the role
- Develop detailed, domain-specific benchmarks for evaluating world foundation models, especially generation and understanding world models that reason about video, simulation, and physical environments.
- Use sophisticated prompt engineering techniques to elicit structured, interpretable responses from a variety of foundation models.
- Build, refine, and maintain question banks, multiple-choice formats, and test suites to support both automated and human evaluation workflows.
- Employ multiple VLMs in parallel to explore ensemble evaluation methods such as majority voting, ranking agreement, and answer consensus.
- Make evaluation as automated and scalable as possible by encoding prompts and expected outputs into structured formats for downstream consumption.
- Interface directly with Cosmos researchers to translate their evaluation needs into scalable test cases.
- Collaborate with human annotators, providing clearly structured tasks, feedback loops, and quality control mechanisms to ensure dataset reliability.
- Meet regularly with domain experts in robotics, autonomous vehicles, and simulation to understand their internal benchmarks, derive transferable metrics, and co-develop standardized evaluation formats.
Requirements
- 10+ years of experience in Machine Learning, NLP, Human-Computer Interaction, or related fields.
- BS, MS, or equivalent background.
- Familiarity with evaluating models via prompting, capturing structured outputs, and comparing across model families.
- Strong attention to detail in designing natural language questions and formatting structured evaluations.
- Proven ability to reason about model capabilities, failure modes, and blind spots in real-world generative model deployments.
- Excellent communication and collaboration skills—you will regularly meet with researchers, annotators, and downstream users to iterate on benchmark design.
- A working understanding of how VLMs and foundation models function at inference time, including token-level outputs, autoregressive decoding, and model context windows.
Benefits
- equity
- benefits 📊 Resume Score Upload your resume to see if it passes auto-rejection tools used by recruiters Check Resume Score
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
Machine LearningNatural Language ProcessingHuman-Computer InteractionPrompt EngineeringModel EvaluationEnsemble Evaluation MethodsStructured OutputsToken-Level OutputsAutoregressive DecodingModel Context Windows
Soft skills
Attention to DetailCommunication SkillsCollaboration SkillsFeedback LoopsQuality ControlIterative Design
Certifications
BSMS