Senior Prompt and Benchmark Engineer, Evaluation of World Models

NVIDIA

full-time

Posted on: 11/18/2025

Location Type: Remote

Location: Remote • California • 🇺🇸 United States

Visit company website

✨ AI Apply

Apply

Salary

💰 $184,000 - $356,500 per year

Job Level

Senior

About the role

Develop detailed, domain-specific benchmarks for evaluating world foundation models, especially generation and understanding world models that reason about video, simulation, and physical environments.
Use sophisticated prompt engineering techniques to elicit structured, interpretable responses from a variety of foundation models.
Build, refine, and maintain question banks, multiple-choice formats, and test suites to support both automated and human evaluation workflows.
Employ multiple VLMs in parallel to explore ensemble evaluation methods such as majority voting, ranking agreement, and answer consensus.
Make evaluation as automated and scalable as possible by encoding prompts and expected outputs into structured formats for downstream consumption.
Interface directly with Cosmos researchers to translate their evaluation needs into scalable test cases.
Collaborate with human annotators, providing clearly structured tasks, feedback loops, and quality control mechanisms to ensure dataset reliability.
Meet regularly with domain experts in robotics, autonomous vehicles, and simulation to understand their internal benchmarks, derive transferable metrics, and co-develop standardized evaluation formats.

Requirements

10+ years of experience in Machine Learning, NLP, Human-Computer Interaction, or related fields.
BS, MS, or equivalent background.
Familiarity with evaluating models via prompting, capturing structured outputs, and comparing across model families.
Strong attention to detail in designing natural language questions and formatting structured evaluations.
Proven ability to reason about model capabilities, failure modes, and blind spots in real-world generative model deployments.
Excellent communication and collaboration skills—you will regularly meet with researchers, annotators, and downstream users to iterate on benchmark design.
A working understanding of how VLMs and foundation models function at inference time, including token-level outputs, autoregressive decoding, and model context windows.

Benefits

equity
benefits 📊 Resume Score Upload your resume to see if it passes auto-rejection tools used by recruiters Check Resume Score

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills

Machine LearningNatural Language ProcessingHuman-Computer InteractionPrompt EngineeringModel EvaluationEnsemble Evaluation MethodsStructured OutputsToken-Level OutputsAutoregressive DecodingModel Context Windows

Soft skills

Attention to DetailCommunication SkillsCollaboration SkillsFeedback LoopsQuality ControlIterative Design

Certifications

BSMS