NVIDIA

Senior DGX Cloud Performance Engineer

NVIDIA

full-time

Posted on:

Location Type: Remote

Location: CaliforniaTexasUnited States

Visit company website

Explore more

AI Apply
Apply

Salary

💰 $152,000 - $287,500 per year

Job Level

About the role

  • Develop benchmarks, end to end customer applications running at scale, instrumented for performance measurements, tracking, sampling, to measure and optimize performance of important applications and services;
  • Construct carefully designed experiments to analyze, study and develop critical insights into performance bottlenecks, dependencies, from an end to end perspective;
  • Develop ideas on how to improve the end to end system performance and usability by driving changes in the HW or SW (or both).
  • Collaborate with AI researchers, developers, and application service providers to understand internal developer and external customer pain points, requirements, project future needs and share best practice.
  • Develop the necessary modeling framework and the TCO (total cost of ownership) analysis to enable efficient exploration and sweep of the architecture and design space
  • Develop the methodology needed to drive the engineering analysis to Inform the architecture, design and roadmap of DGX Cloud

Requirements

  • Expertise in working with large scale parallel and distributed accelerator-based system systems
  • Expertise optimizing performance and AI workloads on large scale systems
  • Experience with performance modeling and benchmarking at scale
  • Strong background in Computer Architecture, Networking, Storage systems, Accelerators
  • Familiarity with popular AI frameworks (PyTorch, TensorFlow, JAX, Megatron-LM, Tensort-LLM, VLLM) among others
  • Experience with AI/ML models and workloads, in particular LLMs as well as an understanding of DNNs and their use in emerging AI/ML applications and services
  • Bachelors/Masters in Engineering or equivalent experience (preferably, Electrical Engineering, Computer Engineering, or Computer Science)
  • 5+ years experience in the above areas
  • Proficiency in Python, C/C++
  • Expertise with at least one of public CSP infrastructure (GCP, AWS, Azure, OCI, …)
Benefits
  • equity
  • benefits 📊 Check your resume score for this job Improve your chances of getting an interview by checking your resume score before you apply. Check Resume Score
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
performance modelingbenchmarkingAI workloads optimizationcomputer architecturenetworkingstorage systemsacceleratorsPythonC/C++AI frameworks
Soft Skills
collaborationproblem-solvinganalytical thinkingcommunication