NVIDIA

Senior AI and ML HPC Cluster Engineer

NVIDIA

full-time

Posted on:

Location Type: Remote

Location: Remote • California, Colorado, Illinois, Texas, Washington • 🇺🇸 United States

Visit company website
AI Apply
Apply

Salary

💰 $136,000 - $264,500 per year

Job Level

Senior

Tech Stack

AnsibleCloudDockerKubernetesLinuxPuppetPythonSaltStack

About the role

  • Provide leadership and strategic guidance on the management of large-scale HPC systems including the deployment of compute, networking, and storage.
  • Develop and improve our ecosystem around GPU-accelerated computing including developing scalable automation solutions
  • Build and maintain AI and ML heterogeneous clusters on-premises and in the cloud
  • Create and cultivate customer and cross-team relationships to reliably sustain the clusters and meet user evolving needs
  • Support our researchers to run their workloads including performance analysis and optimizations
  • Conduct root cause analysis and suggest corrective action
  • Proactively find and fix issues before they occur

Requirements

  • Bachelor’s degree in Computer Science, Electrical Engineering or related field or equivalent experience
  • Minimum 5+ years of experience designing and operating large scale compute infrastructure
  • Experience with AI/HPC advanced job schedulers, such as Slurm, K8s, PBS, RTDA or LSF
  • Proficient in administering Centos/RHEL and/or Ubuntu Linux distributions
  • Solid understanding of cluster configuration management tools such as Ansible, Puppet, Salt
  • In-depth understanding of container technologies like Docker, Singularity, Podman, Shifter, Charliecloud
  • Proficiency in Python programming and bash scripting
  • Applied experience with AI/HPC workflows that use MPI
  • Experience analyzing and tuning performance for a variety of AI/HPC workloads.
  • Passion for continual learning and staying ahead of emerging technologies and effective approaches in the HPC and AI/ML infrastructure fields.
Benefits
  • equity
  • benefits 📊 Resume Score Upload your resume to see if it passes auto-rejection tools used by recruiters Check Resume Score

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills
GPU-accelerated computingAIMLHPC systemsPythonbash scriptingMPIperformance analysiscluster configuration managementautomation solutions
Soft skills
leadershipstrategic guidancecustomer relationship managementcross-team collaborationproblem-solvingproactive issue resolutioncommunicationperformance optimizationroot cause analysiscontinual learning
Switzerland Global Enterprise

AI Transformation Leader

Switzerland Global Enterprise
Seniorfull-time$143k–$234k / year🇺🇸 United States
Posted: 1 day agoSource: gevernova.wd5.myworkdayjobs.com
ERPPython
Brown & Brown Insurance

AI & Data Integration Architect

Brown & Brown Insurance
Senior · Leadfull-time$140k–$160k / year🇺🇸 United States
Posted: 1 day agoSource: bbinsurance.wd1.myworkdayjobs.com
AWSAzureCloudETLGoogle Cloud PlatformKubernetes
Highmark Health

Artificial Intelligence Consultant

Highmark Health
Mid · Seniorfull-time$92k–$173k / yearLouisiana, Maryland, North Carolina, Pennsylvania, Washington · 🇺🇸 United States
Posted: 1 day agoSource: highmarkhealth.wd1.myworkdayjobs.com
Miovision

Atlassian Site Administrator – AI Focus

Miovision
Mid · Seniorfull-time🇺🇸 United States
Posted: 2 days agoSource: miovision.applytojob.com
ITSM