
AI and ML HPC Cluster Engineer
NVIDIA
full-time
Posted on:
Location Type: Remote
Location: Remote • California, Colorado, Illinois, Texas, Washington • 🇺🇸 United States
Visit company websiteSalary
💰 $120,000 - $189,750 per year
Job Level
JuniorMid-Level
Tech Stack
AnsibleCloudDockerKubernetesLinuxNode.jsPuppetPythonSaltStack
About the role
- Support day-to-day operations of production on-premises and multi-cloud AI/HPC clusters, ensuring system health, user satisfaction, and efficient resource utilization.
- Directly administer internal research clusters, conduct upgrades, incident response, and reliability improvements.
- Develop and improve our ecosystem around GPU-accelerated computing including developing scalable automation solutions.
- Maintain heterogeneous AI/ML clusters on-premises and in the cloud.
- Support our researchers to run their workloads including performance analysis and optimizations
- Analyze and optimize cluster efficiency, job fragmentation, and GPU waste to meet internal SLA targets.
- Support root cause analysis and suggest corrective action.
- Proactively find and fix issues before they occur.
- Triage and support postmortems for reliability incidents affecting users or infrastructure.
- Participate in a shared on-call rotation supported by strong automation, clear paths for responding to critical issues, and well-defined incident workflows.
Requirements
- Bachelor’s degree in Computer Science, Electrical Engineering or related field or equivalent experience
- Minimum 2 years of experience administering multi-node compute infrastructure
- Background in managing AI/HPC job schedulers like Slurm, K8s, PBS, RTDA, BCM (formerly known as Bright), or LSF
- Proficient in administering Centos/RHEL and/or Ubuntu Linux distributions
- Proven understanding of cluster configuration management tools (Ansible, Puppet, Salt, etc.), container technologies (Docker, Singularity, Podman, Shifter, Charliecloud), Python programming, and bash scripting.
- Passion for continual learning and staying ahead of emerging technologies and effective approaches in the HPC and AI/ML infrastructure fields.
Benefits
- Equity
- Benefits 📊 Check your resume score for this job Improve your chances of getting an interview by checking your resume score before you apply. Check Resume Score
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
GPU-accelerated computingautomation solutionsperformance analysisjob fragmentation optimizationroot cause analysisPython programmingbash scriptingcluster configuration managementmulti-node compute infrastructureAI/HPC job schedulers
Soft skills
user satisfactionefficient resource utilizationproactive issue resolutionincident responsecollaborationproblem-solvinganalytical thinkingcommunicationadaptabilityattention to detail