FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.

Principal AI and ML Infra Software Engineer, GPU Clusters
NVIDIAPrincipal Engineer enhancing AI/ML infrastructure efficiency for NVIDIA. Collaborating with research teams to optimize GPU Clusters and improve AI workflows.
Posted 4/28/2026full-timeSanta Clara • California, Washington • 🇺🇸 United StatesLead💰 $272,000 - $431,250 per yearWebsite
Tech Stack
Tools & technologiesAWSAzureCloudDockerGoGoogle Cloud PlatformKubernetesPythonPyTorch
About the role
Key responsibilities & impact- Engage closely with our AI and ML research teams to discern their infrastructure requirements and barriers
- Convert those insights into actionable improvements
- Proactively identify researcher efficiency bottlenecks and lead initiatives to systematically improve it
- Drive the direction and long-term roadmaps for such initiatives
- Monitor and optimize the performance of our infrastructure ensuring high availability, scalability, and efficient resource utilization
- Help define and improve important measures of AI researcher efficiency
- Work closely with a variety of teams, such as researchers, data engineers, and DevOps professionals
- Keep up to date with the most recent developments in AI/ML technologies, frameworks, and successful strategies
Requirements
What you’ll need- BS or similar background in Computer Science or related area (or equivalent experience)
- 15+ years of demonstrated expertise in AI/ML and HPC tasks and systems
- Hands-on experience in using or operating High Performance Computing (HPC) grade infrastructure
- In-depth knowledge of accelerated computing (e.g., GPU, custom silicon), storage (e.g., Lustre, GPFS, BeeGFS), scheduling & orchestration (e.g., Slurm, Kubernetes, LSF)
- High-speed networking experience (e.g., Infiniband, RoCE, Amazon EFA)
- Containers technologies expertise (Docker, Enroot)
- Capability in supervising and improving substantial distributed training operations using PyTorch (DDP, FSDP), NeMo, or JAX
- An in-depth understanding of AI/ML workflows, involving data processing, model training, and inference pipelines
- Proficiency in programming & scripting languages such as Python, Go, Bash
- Familiarity with cloud computing platforms (e.g., AWS, GCP, Azure)
- Experience with parallel computing frameworks and paradigms
- Dedication to ongoing learning and staying updated on new technologies and innovative methods in the AI/ML infrastructure sector
- Excellent communication and collaboration skills.
Benefits
Comp & perks- competitive salaries
- comprehensive benefits package
- equity
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
AI/MLHPCHigh Performance Computingaccelerated computingGPUcustom siliconstorageschedulingorchestrationprogramming languages
Soft Skills
communicationcollaborationleadershipproblem-solvinginitiativeefficiency improvementmonitoringoptimizationproactive identificationdedication to learning
Certifications
BS in Computer Scienceequivalent experience