Baseten

Tech Lead/Manager – Model Training

Baseten

full-time

Posted on:

Origin:  • 🇺🇸 United States • California

Visit company website
AI Apply
Manual Apply

Salary

💰 $250,000 - $300,000 per year

Job Level

Senior

Tech Stack

AWSAzureCloudDistributed SystemsGoogle Cloud PlatformKubernetesRaySpark

About the role

  • Lead, mentor, and grow a team of engineers building Baseten’s training infrastructure
  • Define and drive the technical strategy for large-scale training systems, with a focus on scalability, reliability, and efficiency
  • Architect and optimize distributed training pipelines across heterogeneous GPU/accelerator environments
  • Balance hands-on contributions (system design, code reviews, prototyping) with people leadership and career development
  • Establish best practices for training workflows, distributed systems design, and high-performance model evaluation
  • Collaborate with Product and Platform Engineering to translate customer and internal needs into reusable infrastructure and APIs
  • Develop processes that ensure consistent, reliable, and on-time delivery of high-quality systems
  • Stay ahead of the curve on advancements in training efficiency (FSDP, ZeRO, parameter-efficient training, hardware-aware scheduling) and bring them into production

Requirements

  • Bachelor’s degree in Computer Science, Engineering, or related field, or equivalent experience
  • 5+ years of experience in ML infrastructure, distributed systems, or ML platform engineering, including 2+ years in a tech lead or manager role
  • Strong expertise in distributed training frameworks and orchestration (FSDP, DDP, ZeRO, Ray, Kubernetes, Slurm, or similar)
  • Hands-on experience building or scaling training infrastructure for LLMs or other foundation models
  • Deep understanding of GPU/accelerator hardware utilization, mixed precision training, and scaling efficiency
  • Proven ability to lead and mentor technical teams while delivering complex infrastructure projects
  • Excellent communication skills, with the ability to bridge technical depth and business needs
  • Nice to have: Experience with multi-tenant, production-grade ML platforms
  • Nice to have: Familiarity with cluster management, GPU scheduling, or elastic resource scaling
  • Nice to have: Knowledge of advanced model adaptation techniques (LoRA, QLoRA, RLHF, DPO)
  • Nice to have: Contributions to open-source distributed training or ML infrastructure projects
  • Nice to have: Experience building developer-friendly APIs or SDKs for ML workflows
  • Nice to have: Cloud-native infrastructure experience (AWS, GCP, Azure, containerization, orchestration)