FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.

Senior Software Engineer, DGX Cloud AI Infrastructure
NVIDIASenior Software Engineer at NVIDIA leading the optimization of large-scale AI workloads across GPU platforms. Focused on benchmarking, performance tuning, and infrastructure resilience with a team of experts.
Posted 6/4/2026full-timeRemote • California, Oregon, Texas, Washington • 🇺🇸 United StatesSenior💰 $184,000 - $356,500 per yearWebsite
Tech Stack
Tools & technologiesDistributed SystemsNode.jsPythonPyTorch
About the role
Key responsibilities & impact- Lead bring-up, validation, and debugging of large-scale AI clusters, infrastructure, and end-to-end workloads, setting the standard for how the team operates.
- Bring up, tune, and benchmark AI pre-training, post-training, and inference workloads using PyTorch, NeMo / Megatron, TensorRT-LLM, and adjacent NVIDIA AI software stacks.
- Profile and optimize end-to-end workload performance across compute, memory, networking, and communication layers using tools such as Nsight Systems, NCCL tests, and custom microbenchmarks.
- Analyze scaling efficiency for distributed LLM workloads using data, tensor, pipeline, and expert parallelism across modern GPU clusters, and translate findings into concrete tuning guidance.
- Own root-cause analysis of complex failures — hangs, performance regressions, topology sensitivity in large distributed environments.
- Define and build the resilience and failure-attribution stack: detecting, triaging, and attributing node, fabric, and workload failures across the cluster at scale.
- Build repeatable benchmark suites, automation, acceptance criteria, and qualification workflows on new platforms.
- Tune runtime settings, communication parameters, and deployment configurations in close partnership with framework, systems, and platform teams.
- Deliver actionable, data-driven recommendations based on profiling, benchmark results, and cluster characterization.
- Mentor engineers, drive technical standards, and act as a force multiplier across the broader performance and infrastructure organization.
Requirements
What you’ll need- Bachelor’s or Master’s in Computer Science or a related technical field (or equivalent experience).
- 8+ years of experience developing software infrastructure for large-scale AI or HPC systems, including a track record of technical leadership.
- Expertise debugging and triaging AI applications across the full stack — from the application layer down to the hardware.
- Deep hands-on experience with NCCL, CUDA-aware distributed execution, and debugging multi-GPU and multi-node workloads at scale.
- Proven track record of architecting, debugging, and scaling large-scale distributed systems.
- Expert-level Python and C/C++ programming skills.
- Experience operating workloads in scheduled, containerized cluster environments.
- Excellent analytical, debugging, and communication skills, with the ability to influence across teams.
Benefits
Comp & perks- equity
- benefits 📊 Check your resume score for this job Improve your chances of getting an interview by checking your resume score before you apply. Check Resume Score
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
PythonC/C++NCCLCUDAPyTorchNeMoMegatronTensorRT-LLMdebuggingperformance optimization
Soft Skills
analytical skillscommunication skillstechnical leadershipmentoringinfluencing