FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.

Software Engineer, DGX Cloud AI Infrastructure
NVIDIASoftware Engineer optimizing distributed AI workloads on NVIDIA's GPU platforms. Focusing on benchmarking, analysis, and debugging for large-scale AI systems and infrastructure.
Posted 6/4/2026full-timeSanta Clara • California, Oregon, Texas, Washington • 🇺🇸 United StatesMid-LevelSenior💰 $116,000 - $224,250 per yearWebsite
Tech Stack
Tools & technologiesDistributed SystemsNode.jsPythonPyTorch
About the role
Key responsibilities & impact- Bring up, validate, and debug large-scale AI clusters, infrastructure, and end-to-end workloads.
- Bring up, tune, and benchmark AI pre-training, post-training, and inference workloads using PyTorch, NeMo / Megatron, TensorRT-LLM, and adjacent NVIDIA AI software stacks.
- Perform root-cause analysis of failures in large distributed environments.
- Contribute to the resilience and failure-attribution tooling that detects, triages, and attributes node, fabric, and workload failures across the cluster.
- Build and maintain repeatable benchmark suites, automation, acceptance criteria, and qualification workflows on new platforms.
- Tune runtime settings, communication parameters, and deployment configurations in close partnership with framework, systems, and platform teams.
- Deliver actionable, data-driven recommendations based on profiling, benchmark results, and cluster characterization.
Requirements
What you’ll need- Bachelor’s or Master’s in Computer Science or a related technical field (or equivalent experience).
- 3+ years of experience developing software for AI, HPC, or systems-level applications.
- Hands-on experience with multi-GPU or multi-node workloads and CUDA-aware distributed execution.
- Background with debugging and scaling distributed systems.
- Experience debugging and triaging AI applications across the full stack, from the application level toward the hardware.
- Experience operating workloads in scheduled, containerized cluster environments.
- Excellent analytical, debugging, and communication skills, and a collaborative approach across teams.
- Strong Python and C/C++ programming skills.
Benefits
Comp & perks- equity
- benefits 📊 Check your resume score for this job Improve your chances of getting an interview by checking your resume score before you apply. Check Resume Score
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
PythonC/C++PyTorchNeMoMegatronTensorRT-LLMCUDAAI applicationsHPCdistributed systems
Soft Skills
analytical skillsdebugging skillscommunication skillscollaborative approach
Certifications
Bachelor’s in Computer ScienceMaster’s in Computer Science