
Senior AI-HPC Cluster Engineer – MLOps
NVIDIA
full-time
Posted on:
Location Type: Remote
Location: California • Texas • United States
Visit company websiteExplore more
Salary
💰 $184,000 - $356,500 per year
Job Level
About the role
- Provide leadership and strategic mentorship on the management of large-scale HPC systems including the deployment of compute, networking, and storage
- Develop and improve our ecosystem around GPU-accelerated computing including developing scalable automation solutions
- Build and nurture customer and cross-team relationships to consistently support the clusters and address changing user needs
- Support our researchers to run their workloads including performance analysis and optimizations
- Conduct root cause analysis and suggest corrective action
- Proactively find and fix issues before they occur
- Build innovative tooling to accelerate researchers' velocity, troubleshooting, and software performance at scale
Requirements
- Bachelor’s degree in Computer Science, Electrical Engineering or related field or equivalent experience
- Minimum of 8+ years of experience crafting and operating large scale compute infrastructure
- Experience with AI/HPC job schedulers and orchestrators, such as Slurm, K8s or LSF
- Applied experience with AI/HPC workflows that use MPI and NCCL
- Proficient in using Linux including Centos/RHEL and/or Ubuntu Linux distributions
- A solid understanding of container technologies like Enroot, Docker and Podman
- Proficiency in one scripting language (Python, Bash) and at least one compiled language (Golang, Rust, C, C++...)
- Experience analyzing and tuning performance for a variety of AI/HPC workloads
- Excellent problem-solving to analyze complex systems, identify bottlenecks, and implement scalable solutions
- Excellent communication and teamwork skills, with the ability to work effectively with diverse teams and individuals
- Passion for continual learning and staying ahead of new technologies and effective approaches in the HPC and AI/ML infrastructure fields.
Benefits
- equity
- benefits 📊 Check your resume score for this job Improve your chances of getting an interview by checking your resume score before you apply. Check Resume Score
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
HPC systems managementGPU-accelerated computingperformance analysisroot cause analysisAI/HPC job schedulersMPINCCLLinuxcontainer technologiesscripting languages
Soft Skills
leadershipstrategic mentorshipcustomer relationship managementproblem-solvingcommunicationteamworkadaptabilityinnovationcollaborationpassion for learning