Apply

Ready to go for it?

AI Apply speeds things up—apply directly if you prefer.

FREE ACCESS
5,000–10,000 jobs/day
JobTailor Logo

See all jobs on JobTailor

Search thousands of fresh jobs every day.

Discover
  • Fresh listings
  • Fast filters
  • No subscription required
Create a free account and start exploring right away.
NVIDIA

Principal AI and ML Infra Software Engineer, GPU Clusters

NVIDIA

Principal Engineer enhancing AI/ML infrastructure efficiency for NVIDIA. Collaborating with research teams to optimize GPU Clusters and improve AI workflows.

Posted 4/28/2026full-timeSanta Clara • California, Washington • 🇺🇸 United StatesLead💰 $272,000 - $431,250 per yearWebsite

Tech Stack

Tools & technologies
AWSAzureCloudDockerGoGoogle Cloud PlatformKubernetesPythonPyTorch

About the role

Key responsibilities & impact
  • Engage closely with our AI and ML research teams to discern their infrastructure requirements and barriers
  • Convert those insights into actionable improvements
  • Proactively identify researcher efficiency bottlenecks and lead initiatives to systematically improve it
  • Drive the direction and long-term roadmaps for such initiatives
  • Monitor and optimize the performance of our infrastructure ensuring high availability, scalability, and efficient resource utilization
  • Help define and improve important measures of AI researcher efficiency
  • Work closely with a variety of teams, such as researchers, data engineers, and DevOps professionals
  • Keep up to date with the most recent developments in AI/ML technologies, frameworks, and successful strategies

Requirements

What you’ll need
  • BS or similar background in Computer Science or related area (or equivalent experience)
  • 15+ years of demonstrated expertise in AI/ML and HPC tasks and systems
  • Hands-on experience in using or operating High Performance Computing (HPC) grade infrastructure
  • In-depth knowledge of accelerated computing (e.g., GPU, custom silicon), storage (e.g., Lustre, GPFS, BeeGFS), scheduling & orchestration (e.g., Slurm, Kubernetes, LSF)
  • High-speed networking experience (e.g., Infiniband, RoCE, Amazon EFA)
  • Containers technologies expertise (Docker, Enroot)
  • Capability in supervising and improving substantial distributed training operations using PyTorch (DDP, FSDP), NeMo, or JAX
  • An in-depth understanding of AI/ML workflows, involving data processing, model training, and inference pipelines
  • Proficiency in programming & scripting languages such as Python, Go, Bash
  • Familiarity with cloud computing platforms (e.g., AWS, GCP, Azure)
  • Experience with parallel computing frameworks and paradigms
  • Dedication to ongoing learning and staying updated on new technologies and innovative methods in the AI/ML infrastructure sector
  • Excellent communication and collaboration skills.

Benefits

Comp & perks
  • competitive salaries
  • comprehensive benefits package
  • equity

ATS Keywords

✓ Tailor your resume
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
AI/MLHPCHigh Performance Computingaccelerated computingGPUcustom siliconstorageschedulingorchestrationprogramming languages
Soft Skills
communicationcollaborationleadershipproblem-solvinginitiativeefficiency improvementmonitoringoptimizationproactive identificationdedication to learning
Certifications
BS in Computer Scienceequivalent experience