NVIDIA

Senior HPC Cluster Engineer

NVIDIA

full-time

Posted on:

Location Type: Office

Location: Santa ClaraCaliforniaTexasUnited States

Visit company website

Explore more

AI Apply
Apply

Salary

💰 $152,000 - $241,500 per year

Job Level

About the role

  • Develop and enhance our ecosystem around GPU-accelerated computing including developing scalable automation solutions.
  • Continuously improve infrastructure provisioning, management, observability and day to day operation through automation.
  • Provide technical leadership and strategic guidance for managing large-scale HPC systems, including the deployment of compute, networking, and storage.
  • Foster strong customer and multi-functional partnerships to ensure consistent cluster support and rapidly adapt to evolving user needs.
  • Support our researchers to run their EDA workloads including performance analysis and optimizations.
  • Conduct root cause analysis and suggest corrective action.
  • Proactively find and fix issues before they occur.
  • Build innovative tooling to accelerate researchers' velocity, debugging and software performance at scale.

Requirements

  • Bachelor’s degree in Computer Science, Electrical Engineering or related field or equivalent experience.
  • Minimum of 5 years of proven experience crafting and operating large scale compute infrastructure, including cluster configuration managements tools such as BCM or Ansible.
  • Experience with AI/HPC job schedulers and orchestrators, such as Slurm, LSF, PBS or K8s.
  • Applied experience with AI/HPC workflows that use MPI and NCCL.
  • Proficient in using Linux including Rocky/Centos/RHEL and/or Ubuntu Linux distributions.
  • A solid understanding of container technologies such Enroot and Docker.
  • Proficiency in Python and Bash.
  • Experience analyzing and tuning performance for a variety of EDA workloads.
  • Excellent problem-solving to analyze complex systems, identify bottlenecks, and implement scalable solutions.
  • Excellent communication and collaboration skills, with the ability to work effectively with various teams and individuals.
  • Passion for continual learning and staying ahead of new technologies and effective approaches in the HPC infrastructure fields.
Benefits
  • equity
  • benefits 📊 Check your resume score for this job Improve your chances of getting an interview by checking your resume score before you apply. Check Resume Score
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
GPU-accelerated computingautomation solutionsHPC systemscluster configuration managementAI/HPC job schedulersMPINCCLLinuxPythonBash
Soft Skills
technical leadershipproblem-solvingcommunicationcollaborationcustomer partnershipstrategic guidanceroot cause analysisadaptabilityinnovationperformance analysis