Salary
💰 $184,000 - $356,500 per year
Tech Stack
AWSAzureCloudDistributed SystemsGoGoogle Cloud PlatformKubernetesPython
About the role
- Design, develop, and operate distributed systems that manage data, compute, and networking for large-scale AI workloads.
- Build software and automation to orchestrate workloads across thousands of GPUs and petabytes of storage in multi-region clusters.
- Collaborate with AI/ML research teams to understand their requirements and translate them into scalable, high-performance solutions.
- Drive improvements in system reliability, performance, and observability to meet exascale standards.
- Partner with security, networking, and platform teams to ensure that MARS infrastructure meets the highest standards of robustness and compliance.
- Participate in design reviews, contribute to system architecture discussions, and influence the evolution of NVIDIA’s AI infrastructure stack.
- Stay current with advances in distributed systems, large-scale computing, and AI frameworks to help shape the future direction of MARS.
Requirements
- BS or equivalent experience in Computer Science, Computer Engineering, or a related technical field.
- 8+ years of experience developing and operating large-scale distributed systems, infrastructure platforms, or HPC environments.
- Strong programming skills in C++, Python, or Go, with proven experience designing production-quality software systems.
- Solid understanding of distributed systems principles, data management, and large-scale orchestration frameworks.
- Hands-on experience with high-performance storage (e.g., Lustre, GPFS, BeeGFS) and compute scheduling and orchestration (e.g., Slurm, Kubernetes, LSF).
- Familiarity with cloud environments (Azure, AWS, GCP) and infrastructure automation tools.
- Strong problem-solving skills, ownership mindset, and the ability to thrive in a fast-paced, collaborative environment.
- Excellent communication skills and a track record of cross-functional collaboration.
- Equity
- Benefits
📊 Resume Score
Upload your resume to see if it passes auto-rejection tools used by recruiters
Check Resume Score
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
C++PythonGodistributed systemsdata managementlarge-scale orchestration frameworkshigh-performance storagecompute schedulinginfrastructure automationproduction-quality software systems
Soft skills
problem-solvingownership mindsetcollaborationcommunication