NVIDIA

Senior Prompt and Benchmark Engineer, Evaluation of World Models

NVIDIA

full-time

Posted on:

Location Type: Remote

Location: Remote • California • 🇺🇸 United States

Visit company website
AI Apply
Apply

Salary

💰 $184,000 - $356,500 per year

Job Level

Senior

About the role

  • Develop detailed, domain-specific benchmarks for evaluating world foundation models, especially generation and understanding world models that reason about video, simulation, and physical environments.
  • Use sophisticated prompt engineering techniques to elicit structured, interpretable responses from a variety of foundation models.
  • Build, refine, and maintain question banks, multiple-choice formats, and test suites to support both automated and human evaluation workflows.
  • Employ multiple VLMs in parallel to explore ensemble evaluation methods such as majority voting, ranking agreement, and answer consensus.
  • Make evaluation as automated and scalable as possible by encoding prompts and expected outputs into structured formats for downstream consumption.
  • Interface directly with Cosmos researchers to translate their evaluation needs into scalable test cases.
  • Collaborate with human annotators, providing clearly structured tasks, feedback loops, and quality control mechanisms to ensure dataset reliability.
  • Meet regularly with domain experts in robotics, autonomous vehicles, and simulation to understand their internal benchmarks, derive transferable metrics, and co-develop standardized evaluation formats.

Requirements

  • 10+ years of experience in Machine Learning, NLP, Human-Computer Interaction, or related fields.
  • BS, MS, or equivalent background.
  • Familiarity with evaluating models via prompting, capturing structured outputs, and comparing across model families.
  • Strong attention to detail in designing natural language questions and formatting structured evaluations.
  • Proven ability to reason about model capabilities, failure modes, and blind spots in real-world generative model deployments.
  • Excellent communication and collaboration skills—you will regularly meet with researchers, annotators, and downstream users to iterate on benchmark design.
  • A working understanding of how VLMs and foundation models function at inference time, including token-level outputs, autoregressive decoding, and model context windows.
Benefits
  • equity
  • benefits 📊 Resume Score Upload your resume to see if it passes auto-rejection tools used by recruiters Check Resume Score

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills
Machine LearningNatural Language ProcessingHuman-Computer InteractionPrompt EngineeringModel EvaluationEnsemble Evaluation MethodsStructured OutputsToken-Level OutputsAutoregressive DecodingModel Context Windows
Soft skills
Attention to DetailCommunication SkillsCollaboration SkillsFeedback LoopsQuality ControlIterative Design
Certifications
BSMS