Principal AI and ML Infra Software Engineer, GPU Clusters

NVIDIA

Principal Engineer enhancing AI/ML infrastructure efficiency for NVIDIA. Collaborating with research teams to optimize GPU Clusters and improve AI workflows.

Posted 4/28/2026full-timeSanta Clara • California, Washington • 🇺🇸 United StatesLead💰 $272,000 - $431,250 per yearWebsite

Tech Stack

Tools & technologies

AWSAzureCloudDockerGoGoogle Cloud PlatformKubernetesPythonPyTorch

About the role

Key responsibilities & impact

Engage closely with our AI and ML research teams to discern their infrastructure requirements and barriers
Convert those insights into actionable improvements
Proactively identify researcher efficiency bottlenecks and lead initiatives to systematically improve it
Drive the direction and long-term roadmaps for such initiatives
Monitor and optimize the performance of our infrastructure ensuring high availability, scalability, and efficient resource utilization
Help define and improve important measures of AI researcher efficiency
Work closely with a variety of teams, such as researchers, data engineers, and DevOps professionals
Keep up to date with the most recent developments in AI/ML technologies, frameworks, and successful strategies

Requirements

What you’ll need

BS or similar background in Computer Science or related area (or equivalent experience)
15+ years of demonstrated expertise in AI/ML and HPC tasks and systems
Hands-on experience in using or operating High Performance Computing (HPC) grade infrastructure
In-depth knowledge of accelerated computing (e.g., GPU, custom silicon), storage (e.g., Lustre, GPFS, BeeGFS), scheduling & orchestration (e.g., Slurm, Kubernetes, LSF)
High-speed networking experience (e.g., Infiniband, RoCE, Amazon EFA)
Containers technologies expertise (Docker, Enroot)
Capability in supervising and improving substantial distributed training operations using PyTorch (DDP, FSDP), NeMo, or JAX
An in-depth understanding of AI/ML workflows, involving data processing, model training, and inference pipelines
Proficiency in programming & scripting languages such as Python, Go, Bash
Familiarity with cloud computing platforms (e.g., AWS, GCP, Azure)
Experience with parallel computing frameworks and paradigms
Dedication to ongoing learning and staying updated on new technologies and innovative methods in the AI/ML infrastructure sector
Excellent communication and collaboration skills.

Benefits

Comp & perks

competitive salaries
comprehensive benefits package
equity

ATS Keywords

✓ Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

AI/MLHPCHigh Performance Computingaccelerated computingGPUcustom siliconstorageschedulingorchestrationprogramming languages

Soft Skills

communicationcollaborationleadershipproblem-solvinginitiativeefficiency improvementmonitoringoptimizationproactive identificationdedication to learning

Certifications

BS in Computer Scienceequivalent experience