Salary
💰 CA$116,250 - CA$247,000 per year
Tech Stack
AWSAzureCloudDistributed SystemsDockerGoogle Cloud PlatformKubernetesNode.jsPythonPyTorch
About the role
- Design and implement highly efficient distributed training systems for large-scale RL models
- Optimize parallelism strategies to improve performance and scalability across hundreds or thousands of GPUs
- Develop low-level systems components and algorithms to maximize throughput and minimize memory and compute bottlenecks
- Productionize the training systems with fault tolerance capabilities and an uncompromised software quality
- Collaborate with researchers and engineers to productionize cutting-edge model architectures and training techniques
- Contribute to the design of APIs, abstractions, and UX that make it easier to scale models while maintaining usability and flexibility
- Profile, debug, and tune performance at the model, system, and hardware levels
- Participate in design discussions, code reviews, and technical planning to ensure the product aligns with the business goals
- Stay up to date with the latest advancements in large-scale model training and help translate research into practical, robust systems
Requirements
- Bachelor’s, Master’s, or PhD degree in Computer Science/Engineering, Software Engineering, a related field, or equivalent experience
- 3+ years of experience in software development, preferably with Python and C++
- Deep understanding of machine learning pipelines and workflows, distributed systems, parallel computing, and high-performance computing principles
- Hands-on experience with large-scale training of deep learning models using frameworks like PyTorch, Megatron Core, or DeepSpeed
- Experience optimizing compute, memory, and communication performance in large model training workflows
- Familiarity with GPU programming, CUDA, NCCL, and performance profiling tools
- Solid grasp of deep learning fundamentals, especially as they relate to RL and training dynamics
- Ability to work closely with both research and engineering teams, translating evolving needs into technical requirements and robust code
- Excellent problem-solving skills, with the ability to debug complex systems
- A passion for building high-impact tools that push the boundaries of what’s possible with large-scale AI