NVIDIA

AI and ML Infra Software Engineer, GPU Clusters

NVIDIA

full-time

Posted on:

Location Type: Office

Location: Santa Clara • California, Washington • 🇺🇸 United States

Visit company website
AI Apply
Apply

Salary

💰 $120,000 - $189,750 per year

Job Level

Junior

Tech Stack

AWSAzureCloudDockerGoGoogle Cloud PlatformKubernetesPythonPyTorch

About the role

  • Collaborate closely with our AI and ML research teams to understand their infrastructure needs and obstacles
  • Monitor and optimize the performance of our infrastructure ensuring high availability, scalability, and efficient resource utilization
  • Help define and improve important measures of AI researcher efficiency
  • Collaborate with diverse teams, including researchers, data engineers, and DevOps professionals
  • Stay on top of the latest advancements in AI/ML technologies, frameworks, and effective strategies

Requirements

  • BS or equivalent experience in Computer Science or related field
  • Proven experience in AI/ML and HPC workloads and infrastructure
  • Hands-on experience in using or operating High Performance Computing (HPC) grade infrastructure
  • In-depth knowledge of accelerated computing (e.g., GPU, custom silicon)
  • Storage (e.g., Lustre, GPFS, BeeGFS)
  • Scheduling & orchestration (e.g., Slurm, Kubernetes, LSF)
  • High-speed networking (e.g., Infiniband, RoCE, Amazon EFA)
  • Containers technologies (Docker, Enroot)
  • Expertise in running and optimizing large-scale distributed training workloads using PyTorch (DDP, FSDP), NeMo, or JAX
  • Deep understanding of AI/ML workflows, encompassing data processing, model training, and inference pipelines
  • Proficiency in programming & scripting languages such as Python, Go, Bash
  • Familiarity with cloud computing platforms (e.g., AWS, GCP, Azure)
  • Experience with parallel computing frameworks and paradigms
  • Passion for continual learning
  • Excellent communication and collaboration skills
Benefits
  • Health insurance
  • Retirement plans
  • Paid time off
  • Flexible work arrangements
  • Professional development opportunities
  • Bonuses
  • Stock options
  • Equipment allowances
  • Wellness programs

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills
AI/MLHPCHigh Performance Computingaccelerated computingGPUcustom siliconstorageschedulingorchestrationprogramming
Soft skills
collaborationcommunicationpassion for learning