
AI and ML Infra Software Engineer, GPU Clusters
NVIDIA
full-time
Posted on:
Location Type: Office
Location: Santa Clara • California, Washington • 🇺🇸 United States
Visit company websiteSalary
💰 $120,000 - $189,750 per year
Job Level
Junior
Tech Stack
AWSAzureCloudDockerGoGoogle Cloud PlatformKubernetesPythonPyTorch
About the role
- Collaborate closely with our AI and ML research teams to understand their infrastructure needs and obstacles
- Monitor and optimize the performance of our infrastructure ensuring high availability, scalability, and efficient resource utilization
- Help define and improve important measures of AI researcher efficiency
- Collaborate with diverse teams, including researchers, data engineers, and DevOps professionals
- Stay on top of the latest advancements in AI/ML technologies, frameworks, and effective strategies
Requirements
- BS or equivalent experience in Computer Science or related field
- Proven experience in AI/ML and HPC workloads and infrastructure
- Hands-on experience in using or operating High Performance Computing (HPC) grade infrastructure
- In-depth knowledge of accelerated computing (e.g., GPU, custom silicon)
- Storage (e.g., Lustre, GPFS, BeeGFS)
- Scheduling & orchestration (e.g., Slurm, Kubernetes, LSF)
- High-speed networking (e.g., Infiniband, RoCE, Amazon EFA)
- Containers technologies (Docker, Enroot)
- Expertise in running and optimizing large-scale distributed training workloads using PyTorch (DDP, FSDP), NeMo, or JAX
- Deep understanding of AI/ML workflows, encompassing data processing, model training, and inference pipelines
- Proficiency in programming & scripting languages such as Python, Go, Bash
- Familiarity with cloud computing platforms (e.g., AWS, GCP, Azure)
- Experience with parallel computing frameworks and paradigms
- Passion for continual learning
- Excellent communication and collaboration skills
Benefits
- Health insurance
- Retirement plans
- Paid time off
- Flexible work arrangements
- Professional development opportunities
- Bonuses
- Stock options
- Equipment allowances
- Wellness programs
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
AI/MLHPCHigh Performance Computingaccelerated computingGPUcustom siliconstorageschedulingorchestrationprogramming
Soft skills
collaborationcommunicationpassion for learning