Senior Software Engineer, AI Systems

NVIDIA

full-time

Posted on: 8/19/2025

Origin: • 🇨🇦 Canada

✨ AI Apply

💰 CA$116,250 - CA$247,000 per year

Senior

AWSAzureCloudDistributed SystemsDockerGoogle Cloud PlatformKubernetesNode.jsPythonPyTorch

About the role

Design and implement highly efficient distributed training systems for large-scale RL models
Optimize parallelism strategies to improve performance and scalability across hundreds or thousands of GPUs
Develop low-level systems components and algorithms to maximize throughput and minimize memory and compute bottlenecks
Productionize the training systems with fault tolerance capabilities and an uncompromised software quality
Collaborate with researchers and engineers to productionize cutting-edge model architectures and training techniques
Contribute to the design of APIs, abstractions, and UX that make it easier to scale models while maintaining usability and flexibility
Profile, debug, and tune performance at the model, system, and hardware levels
Participate in design discussions, code reviews, and technical planning to ensure the product aligns with the business goals
Stay up to date with the latest advancements in large-scale model training and help translate research into practical, robust systems

Bachelor’s, Master’s, or PhD degree in Computer Science/Engineering, Software Engineering, a related field, or equivalent experience
3+ years of experience in software development, preferably with Python and C++
Deep understanding of machine learning pipelines and workflows, distributed systems, parallel computing, and high-performance computing principles
Hands-on experience with large-scale training of deep learning models using frameworks like PyTorch, Megatron Core, or DeepSpeed
Experience optimizing compute, memory, and communication performance in large model training workflows
Familiarity with GPU programming, CUDA, NCCL, and performance profiling tools
Solid grasp of deep learning fundamentals, especially as they relate to RL and training dynamics
Ability to work closely with both research and engineering teams, translating evolving needs into technical requirements and robust code
Excellent problem-solving skills, with the ability to debug complex systems
A passion for building high-impact tools that push the boundaries of what’s possible with large-scale AI