Baseten

Tech Lead/Manager – Model Training

Baseten

full-time

Posted on:

Location: California • 🇺🇸 United States

Visit company website
AI Apply
Apply

Salary

💰 $250,000 - $300,000 per year

Job Level

Senior

Tech Stack

AWSAzureCloudDistributed SystemsGoogle Cloud PlatformKubernetesRaySpark

About the role

  • Lead, mentor, and grow a team of engineers building Baseten’s training infrastructure
  • Define and drive the technical strategy for large-scale training systems, with a focus on scalability, reliability, and efficiency
  • Architect and optimize distributed training pipelines across heterogeneous GPU/accelerator environments
  • Balance hands-on contributions (system design, code reviews, prototyping) with people leadership and career development
  • Establish best practices for training workflows, distributed systems design, and high-performance model evaluation
  • Collaborate with Product and Platform Engineering to translate customer and internal needs into reusable infrastructure and APIs
  • Develop processes that ensure consistent, reliable, and on-time delivery of high-quality systems
  • Stay ahead of the curve on advancements in training efficiency (FSDP, ZeRO, parameter-efficient training, hardware-aware scheduling) and bring them into production

Requirements

  • Bachelor’s degree in Computer Science, Engineering, or related field, or equivalent experience
  • 5+ years of experience in ML infrastructure, distributed systems, or ML platform engineering, including 2+ years in a tech lead or manager role
  • Strong expertise in distributed training frameworks and orchestration (FSDP, DDP, ZeRO, Ray, Kubernetes, Slurm, or similar)
  • Hands-on experience building or scaling training infrastructure for LLMs or other foundation models
  • Deep understanding of GPU/accelerator hardware utilization, mixed precision training, and scaling efficiency
  • Proven ability to lead and mentor technical teams while delivering complex infrastructure projects
  • Excellent communication skills, with the ability to bridge technical depth and business needs
  • Nice to have: Experience with multi-tenant, production-grade ML platforms
  • Nice to have: Familiarity with cluster management, GPU scheduling, or elastic resource scaling
  • Nice to have: Knowledge of advanced model adaptation techniques (LoRA, QLoRA, RLHF, DPO)
  • Nice to have: Contributions to open-source distributed training or ML infrastructure projects
  • Nice to have: Experience building developer-friendly APIs or SDKs for ML workflows
  • Nice to have: Cloud-native infrastructure experience (AWS, GCP, Azure, containerization, orchestration)
Safran

Principal Software Engineer, Connectivity

Safran
Leadfull-time$165k–$187k / yearCalifornia · 🇺🇸 United States
Posted: 54 minutes agoSource: apply.workable.com
JavaScriptPython
PrePass

Software Engineer

PrePass
Mid · Seniorfull-timeArizona · 🇺🇸 United States
Posted: 59 minutes agoSource: apply.workable.com
AzureCloud.NETSQL
Evertune AI

Senior Software Engineer, Full Stack

Evertune AI
Seniorfull-time$140k–$200k / yearNew York · 🇺🇸 United States
Posted: 1 hour agoSource: jobs.ashbyhq.com
CloudGoogle Cloud PlatformJavaScriptPythonVue.js
AirGarage

Senior Embedded Software Engineer

AirGarage
Seniorfull-time$180k–$210k / year🇺🇸 United States
Posted: 2 hours agoSource: jobs.ashbyhq.com
DockerGrafanaKafkaLinuxPrometheusPythonRedis
Newfire Global Partners

Staff Software Engineer

Newfire Global Partners
Leadfull-time🇺🇸 United States
Posted: 3 hours agoSource: newfireglobal.pinpointhq.com
AndroidiOSJavaScriptNode.jsReactReact Native