NVIDIA

AI and ML HPC Cluster Engineer

NVIDIA

full-time

Posted on:

Location Type: Remote

Location: Remote • California, Colorado, Illinois, Texas, Washington • 🇺🇸 United States

Visit company website
AI Apply
Apply

Salary

💰 $120,000 - $189,750 per year

Job Level

JuniorMid-Level

Tech Stack

AnsibleCloudDockerKubernetesLinuxNode.jsPuppetPythonSaltStack

About the role

  • Support day-to-day operations of production on-premises and multi-cloud AI/HPC clusters, ensuring system health, user satisfaction, and efficient resource utilization.
  • Directly administer internal research clusters, conduct upgrades, incident response, and reliability improvements.
  • Develop and improve our ecosystem around GPU-accelerated computing including developing scalable automation solutions.
  • Maintain heterogeneous AI/ML clusters on-premises and in the cloud.
  • Support our researchers to run their workloads including performance analysis and optimizations
  • Analyze and optimize cluster efficiency, job fragmentation, and GPU waste to meet internal SLA targets.
  • Support root cause analysis and suggest corrective action.
  • Proactively find and fix issues before they occur.
  • Triage and support postmortems for reliability incidents affecting users or infrastructure.
  • Participate in a shared on-call rotation supported by strong automation, clear paths for responding to critical issues, and well-defined incident workflows.

Requirements

  • Bachelor’s degree in Computer Science, Electrical Engineering or related field or equivalent experience
  • Minimum 2 years of experience administering multi-node compute infrastructure
  • Background in managing AI/HPC job schedulers like Slurm, K8s, PBS, RTDA, BCM (formerly known as Bright), or LSF
  • Proficient in administering Centos/RHEL and/or Ubuntu Linux distributions
  • Proven understanding of cluster configuration management tools (Ansible, Puppet, Salt, etc.), container technologies (Docker, Singularity, Podman, Shifter, Charliecloud), Python programming, and bash scripting.
  • Passion for continual learning and staying ahead of emerging technologies and effective approaches in the HPC and AI/ML infrastructure fields.
Benefits
  • Equity
  • Benefits 📊 Check your resume score for this job Improve your chances of getting an interview by checking your resume score before you apply. Check Resume Score

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills
GPU-accelerated computingautomation solutionsperformance analysisjob fragmentation optimizationroot cause analysisPython programmingbash scriptingcluster configuration managementmulti-node compute infrastructureAI/HPC job schedulers
Soft skills
user satisfactionefficient resource utilizationproactive issue resolutionincident responsecollaborationproblem-solvinganalytical thinkingcommunicationadaptabilityattention to detail