NVIDIA

Senior AI-HPC Cluster Engineer – MLOps

NVIDIA

full-time

Posted on:

Location Type: Remote

Location: CaliforniaTexasUnited States

Visit company website

Explore more

AI Apply
Apply

Salary

💰 $184,000 - $356,500 per year

Job Level

About the role

  • Provide leadership and strategic mentorship on the management of large-scale HPC systems including the deployment of compute, networking, and storage
  • Develop and improve our ecosystem around GPU-accelerated computing including developing scalable automation solutions
  • Build and nurture customer and cross-team relationships to consistently support the clusters and address changing user needs
  • Support our researchers to run their workloads including performance analysis and optimizations
  • Conduct root cause analysis and suggest corrective action
  • Proactively find and fix issues before they occur
  • Build innovative tooling to accelerate researchers' velocity, troubleshooting, and software performance at scale

Requirements

  • Bachelor’s degree in Computer Science, Electrical Engineering or related field or equivalent experience
  • Minimum of 8+ years of experience crafting and operating large scale compute infrastructure
  • Experience with AI/HPC job schedulers and orchestrators, such as Slurm, K8s or LSF
  • Applied experience with AI/HPC workflows that use MPI and NCCL
  • Proficient in using Linux including Centos/RHEL and/or Ubuntu Linux distributions
  • A solid understanding of container technologies like Enroot, Docker and Podman
  • Proficiency in one scripting language (Python, Bash) and at least one compiled language (Golang, Rust, C, C++...)
  • Experience analyzing and tuning performance for a variety of AI/HPC workloads
  • Excellent problem-solving to analyze complex systems, identify bottlenecks, and implement scalable solutions
  • Excellent communication and teamwork skills, with the ability to work effectively with diverse teams and individuals
  • Passion for continual learning and staying ahead of new technologies and effective approaches in the HPC and AI/ML infrastructure fields.
Benefits
  • equity
  • benefits 📊 Check your resume score for this job Improve your chances of getting an interview by checking your resume score before you apply. Check Resume Score
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
HPC systems managementGPU-accelerated computingperformance analysisroot cause analysisAI/HPC job schedulersMPINCCLLinuxcontainer technologiesscripting languages
Soft Skills
leadershipstrategic mentorshipcustomer relationship managementproblem-solvingcommunicationteamworkadaptabilityinnovationcollaborationpassion for learning