Salary
💰 $250,000 - $300,000 per year
Tech Stack
AWSAzureCloudDistributed SystemsGoogle Cloud PlatformKubernetesRaySpark
About the role
- Lead, mentor, and grow a team of engineers building Baseten’s training infrastructure
- Define and drive the technical strategy for large-scale training systems, with a focus on scalability, reliability, and efficiency
- Architect and optimize distributed training pipelines across heterogeneous GPU/accelerator environments
- Balance hands-on contributions (system design, code reviews, prototyping) with people leadership and career development
- Establish best practices for training workflows, distributed systems design, and high-performance model evaluation
- Collaborate with Product and Platform Engineering to translate customer and internal needs into reusable infrastructure and APIs
- Develop processes that ensure consistent, reliable, and on-time delivery of high-quality systems
- Stay ahead of the curve on advancements in training efficiency (FSDP, ZeRO, parameter-efficient training, hardware-aware scheduling) and bring them into production
Requirements
- Bachelor’s degree in Computer Science, Engineering, or related field, or equivalent experience
- 5+ years of experience in ML infrastructure, distributed systems, or ML platform engineering, including 2+ years in a tech lead or manager role
- Strong expertise in distributed training frameworks and orchestration (FSDP, DDP, ZeRO, Ray, Kubernetes, Slurm, or similar)
- Hands-on experience building or scaling training infrastructure for LLMs or other foundation models
- Deep understanding of GPU/accelerator hardware utilization, mixed precision training, and scaling efficiency
- Proven ability to lead and mentor technical teams while delivering complex infrastructure projects
- Excellent communication skills, with the ability to bridge technical depth and business needs
- Nice to have: Experience with multi-tenant, production-grade ML platforms
- Nice to have: Familiarity with cluster management, GPU scheduling, or elastic resource scaling
- Nice to have: Knowledge of advanced model adaptation techniques (LoRA, QLoRA, RLHF, DPO)
- Nice to have: Contributions to open-source distributed training or ML infrastructure projects
- Nice to have: Experience building developer-friendly APIs or SDKs for ML workflows
- Nice to have: Cloud-native infrastructure experience (AWS, GCP, Azure, containerization, orchestration)