Baseten

Senior Software Engineer, Model Training

Baseten

full-time

Posted on:

Origin:  • 🇺🇸 United States • California, New York

Visit company website
AI Apply
Apply

Salary

💰 $200,000 - $275,000 per year

Job Level

Senior

Tech Stack

AWSAzureCloudDistributed SystemsGoogle Cloud PlatformKubernetesPyTorchRaySpark

About the role

  • Design, build, and maintain distributed training infrastructure for large-scale foundation models
  • Implement scalable pipelines for fine-tuning and training across heterogeneous GPU/accelerator clusters
  • Optimize training performance through techniques like FSDP, DDP, ZeRO, and mixed precision training
  • Contribute to frameworks and tooling that make training workflows efficient, reproducible, and developer-friendly
  • Collaborate with cross-functional teams (Product, Forward Deployed Engineering, Inference Infra) to ensure training systems meet real-world requirements
  • Research and apply emerging techniques in training efficiency and model adaptation, and productionize them in the Baseten platform
  • Participate in code reviews, system design discussions, and technical deep dives to maintain a high engineering bar

Requirements

  • Bachelor’s degree in Computer Science, Engineering, or related field, or equivalent experience
  • 4+ years of experience in software engineering with a focus on ML infrastructure, distributed systems, or ML platform engineering
  • Hands-on expertise in distributed training frameworks (FSDP, DDP, ZeRO, or similar) and ML frameworks (PyTorch, Transformers, Lightning, TRL)
  • Strong understanding of GPU/accelerator performance optimization and scaling techniques
  • Experience designing and operating large-scale systems in production (cloud-native preferred)
  • Excellent problem-solving and communication skills, with the ability to work across infrastructure and ML boundaries
  • Experience building APIs, SDKs, or developer tools for ML workflows (nice to have)
  • Familiarity with cluster management and scheduling (Kubernetes, Ray, Slurm, etc.) (nice to have)
  • Knowledge of parameter-efficient fine-tuning methods (LoRA, QLoRA) and evaluation pipelines (nice to have)
  • Contributions to open-source distributed training or ML infra projects (nice to have)
  • Experience with cloud environments (AWS, GCP, Azure) and container orchestration (nice to have)