Full-stack Engineer

• Drive down wall-clock time to convergence by profiling and eliminating bottlenecks across the foundation model training stack, from data pipelines to GPU kernels
• Design, build, and optimize distributed training systems (PyTorch) for multi-node GPU clusters, ensuring scalability, robustness, and high utilization
• Implement efficient low-level code (CUDA, cuDNN, Triton, custom kernels) and integrate it seamlessly into high-level training frameworks
• Optimize workloads for hardware efficiency: CPU/GPU compute balance, memory management, data throughput, and networking
• Develop monitoring and debugging tools for large-scale runs, enabling rapid diagnosis of performance regressions and failures

Staff Software Engineer, Training

Job Level

Tech Stack

About the role

Requirements

Software Engineer, Distributed Data Systems

Principal Software Engineer – CSP Engagements

Staff Compute Architect, HPC

Senior Software Engineer, Managed Orchestration, Kubernetes

Staff Software Engineer, Modern Structured Storage