NVIDIA

Senior Software Engineer, AI Systems

NVIDIA

full-time

Posted on:

Origin:  • 🇨🇦 Canada

Visit company website
AI Apply
Manual Apply

Salary

💰 CA$116,250 - CA$247,000 per year

Job Level

Senior

Tech Stack

AWSAzureCloudDistributed SystemsDockerGoogle Cloud PlatformKubernetesNode.jsPythonPyTorch

About the role

  • Design and implement highly efficient distributed training systems for large-scale RL models
  • Optimize parallelism strategies to improve performance and scalability across hundreds or thousands of GPUs
  • Develop low-level systems components and algorithms to maximize throughput and minimize memory and compute bottlenecks
  • Productionize the training systems with fault tolerance capabilities and an uncompromised software quality
  • Collaborate with researchers and engineers to productionize cutting-edge model architectures and training techniques
  • Contribute to the design of APIs, abstractions, and UX that make it easier to scale models while maintaining usability and flexibility
  • Profile, debug, and tune performance at the model, system, and hardware levels
  • Participate in design discussions, code reviews, and technical planning to ensure the product aligns with the business goals
  • Stay up to date with the latest advancements in large-scale model training and help translate research into practical, robust systems

Requirements

  • Bachelor’s, Master’s, or PhD degree in Computer Science/Engineering, Software Engineering, a related field, or equivalent experience
  • 3+ years of experience in software development, preferably with Python and C++
  • Deep understanding of machine learning pipelines and workflows, distributed systems, parallel computing, and high-performance computing principles
  • Hands-on experience with large-scale training of deep learning models using frameworks like PyTorch, Megatron Core, or DeepSpeed
  • Experience optimizing compute, memory, and communication performance in large model training workflows
  • Familiarity with GPU programming, CUDA, NCCL, and performance profiling tools
  • Solid grasp of deep learning fundamentals, especially as they relate to RL and training dynamics
  • Ability to work closely with both research and engineering teams, translating evolving needs into technical requirements and robust code
  • Excellent problem-solving skills, with the ability to debug complex systems
  • A passion for building high-impact tools that push the boundaries of what’s possible with large-scale AI