Apply

Ready to go for it?

AI Apply speeds things up—apply directly if you prefer.

FREE ACCESS
5,000–10,000 jobs/day
JobTailor Logo

See all jobs on JobTailor

Search thousands of fresh jobs every day.

Discover
  • Fresh listings
  • Fast filters
  • No subscription required
Create a free account and start exploring right away.
NVIDIA

Senior Software Engineer, DGX Cloud AI Infrastructure

NVIDIA

Senior Software Engineer at NVIDIA leading the optimization of large-scale AI workloads across GPU platforms. Focused on benchmarking, performance tuning, and infrastructure resilience with a team of experts.

Posted 6/4/2026full-timeRemote • California, Oregon, Texas, Washington • 🇺🇸 United StatesSenior💰 $184,000 - $356,500 per yearWebsite

Tech Stack

Tools & technologies
Distributed SystemsNode.jsPythonPyTorch

About the role

Key responsibilities & impact
  • Lead bring-up, validation, and debugging of large-scale AI clusters, infrastructure, and end-to-end workloads, setting the standard for how the team operates.
  • Bring up, tune, and benchmark AI pre-training, post-training, and inference workloads using PyTorch, NeMo / Megatron, TensorRT-LLM, and adjacent NVIDIA AI software stacks.
  • Profile and optimize end-to-end workload performance across compute, memory, networking, and communication layers using tools such as Nsight Systems, NCCL tests, and custom microbenchmarks.
  • Analyze scaling efficiency for distributed LLM workloads using data, tensor, pipeline, and expert parallelism across modern GPU clusters, and translate findings into concrete tuning guidance.
  • Own root-cause analysis of complex failures — hangs, performance regressions, topology sensitivity in large distributed environments.
  • Define and build the resilience and failure-attribution stack: detecting, triaging, and attributing node, fabric, and workload failures across the cluster at scale.
  • Build repeatable benchmark suites, automation, acceptance criteria, and qualification workflows on new platforms.
  • Tune runtime settings, communication parameters, and deployment configurations in close partnership with framework, systems, and platform teams.
  • Deliver actionable, data-driven recommendations based on profiling, benchmark results, and cluster characterization.
  • Mentor engineers, drive technical standards, and act as a force multiplier across the broader performance and infrastructure organization.

Requirements

What you’ll need
  • Bachelor’s or Master’s in Computer Science or a related technical field (or equivalent experience).
  • 8+ years of experience developing software infrastructure for large-scale AI or HPC systems, including a track record of technical leadership.
  • Expertise debugging and triaging AI applications across the full stack — from the application layer down to the hardware.
  • Deep hands-on experience with NCCL, CUDA-aware distributed execution, and debugging multi-GPU and multi-node workloads at scale.
  • Proven track record of architecting, debugging, and scaling large-scale distributed systems.
  • Expert-level Python and C/C++ programming skills.
  • Experience operating workloads in scheduled, containerized cluster environments.
  • Excellent analytical, debugging, and communication skills, with the ability to influence across teams.

Benefits

Comp & perks
  • equity
  • benefits 📊 Check your resume score for this job Improve your chances of getting an interview by checking your resume score before you apply. Check Resume Score

ATS Keywords

✓ Tailor your resume
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
PythonC/C++NCCLCUDAPyTorchNeMoMegatronTensorRT-LLMdebuggingperformance optimization
Soft Skills
analytical skillscommunication skillstechnical leadershipmentoringinfluencing