
Senior DGX Cloud Performance Engineer
NVIDIA
full-time
Posted on:
Location Type: Remote
Location: California • Texas • United States
Visit company websiteExplore more
Salary
💰 $152,000 - $287,500 per year
Job Level
About the role
- Develop benchmarks, end to end customer applications running at scale, instrumented for performance measurements, tracking, sampling, to measure and optimize performance of important applications and services;
- Construct carefully designed experiments to analyze, study and develop critical insights into performance bottlenecks, dependencies, from an end to end perspective;
- Develop ideas on how to improve the end to end system performance and usability by driving changes in the HW or SW (or both).
- Collaborate with AI researchers, developers, and application service providers to understand internal developer and external customer pain points, requirements, project future needs and share best practice.
- Develop the necessary modeling framework and the TCO (total cost of ownership) analysis to enable efficient exploration and sweep of the architecture and design space
- Develop the methodology needed to drive the engineering analysis to Inform the architecture, design and roadmap of DGX Cloud
Requirements
- Expertise in working with large scale parallel and distributed accelerator-based system systems
- Expertise optimizing performance and AI workloads on large scale systems
- Experience with performance modeling and benchmarking at scale
- Strong background in Computer Architecture, Networking, Storage systems, Accelerators
- Familiarity with popular AI frameworks (PyTorch, TensorFlow, JAX, Megatron-LM, Tensort-LLM, VLLM) among others
- Experience with AI/ML models and workloads, in particular LLMs as well as an understanding of DNNs and their use in emerging AI/ML applications and services
- Bachelors/Masters in Engineering or equivalent experience (preferably, Electrical Engineering, Computer Engineering, or Computer Science)
- 5+ years experience in the above areas
- Proficiency in Python, C/C++
- Expertise with at least one of public CSP infrastructure (GCP, AWS, Azure, OCI, …)
Benefits
- equity
- benefits 📊 Check your resume score for this job Improve your chances of getting an interview by checking your resume score before you apply. Check Resume Score
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
performance modelingbenchmarkingAI workloads optimizationcomputer architecturenetworkingstorage systemsacceleratorsPythonC/C++AI frameworks
Soft Skills
collaborationproblem-solvinganalytical thinkingcommunication