
Senior HPC Cluster Engineer
NVIDIA
full-time
Posted on:
Location Type: Office
Location: Santa Clara • California • Texas • United States
Visit company websiteExplore more
Salary
💰 $152,000 - $241,500 per year
Job Level
Tech Stack
About the role
- Develop and enhance our ecosystem around GPU-accelerated computing including developing scalable automation solutions.
- Continuously improve infrastructure provisioning, management, observability and day to day operation through automation.
- Provide technical leadership and strategic guidance for managing large-scale HPC systems, including the deployment of compute, networking, and storage.
- Foster strong customer and multi-functional partnerships to ensure consistent cluster support and rapidly adapt to evolving user needs.
- Support our researchers to run their EDA workloads including performance analysis and optimizations.
- Conduct root cause analysis and suggest corrective action.
- Proactively find and fix issues before they occur.
- Build innovative tooling to accelerate researchers' velocity, debugging and software performance at scale.
Requirements
- Bachelor’s degree in Computer Science, Electrical Engineering or related field or equivalent experience.
- Minimum of 5 years of proven experience crafting and operating large scale compute infrastructure, including cluster configuration managements tools such as BCM or Ansible.
- Experience with AI/HPC job schedulers and orchestrators, such as Slurm, LSF, PBS or K8s.
- Applied experience with AI/HPC workflows that use MPI and NCCL.
- Proficient in using Linux including Rocky/Centos/RHEL and/or Ubuntu Linux distributions.
- A solid understanding of container technologies such Enroot and Docker.
- Proficiency in Python and Bash.
- Experience analyzing and tuning performance for a variety of EDA workloads.
- Excellent problem-solving to analyze complex systems, identify bottlenecks, and implement scalable solutions.
- Excellent communication and collaboration skills, with the ability to work effectively with various teams and individuals.
- Passion for continual learning and staying ahead of new technologies and effective approaches in the HPC infrastructure fields.
Benefits
- equity
- benefits 📊 Check your resume score for this job Improve your chances of getting an interview by checking your resume score before you apply. Check Resume Score
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
GPU-accelerated computingautomation solutionsHPC systemscluster configuration managementAI/HPC job schedulersMPINCCLLinuxPythonBash
Soft Skills
technical leadershipproblem-solvingcommunicationcollaborationcustomer partnershipstrategic guidanceroot cause analysisadaptabilityinnovationperformance analysis