Baseten

Senior Software Engineer, Model Training

Baseten

full-time

Posted on:

Location: California, New York • 🇺🇸 United States

Visit company website
AI Apply
Apply

Salary

💰 $200,000 - $275,000 per year

Job Level

Senior

Tech Stack

AWSAzureCloudDistributed SystemsGoogle Cloud PlatformKubernetesPyTorchRaySpark

About the role

  • Design, build, and maintain distributed training infrastructure for large-scale foundation models
  • Implement scalable pipelines for fine-tuning and training across heterogeneous GPU/accelerator clusters
  • Optimize training performance through techniques like FSDP, DDP, ZeRO, and mixed precision training
  • Contribute to frameworks and tooling that make training workflows efficient, reproducible, and developer-friendly
  • Collaborate with cross-functional teams (Product, Forward Deployed Engineering, Inference Infra) to ensure training systems meet real-world requirements
  • Research and apply emerging techniques in training efficiency and model adaptation, and productionize them in the Baseten platform
  • Participate in code reviews, system design discussions, and technical deep dives to maintain a high engineering bar

Requirements

  • Bachelor’s degree in Computer Science, Engineering, or related field, or equivalent experience
  • 4+ years of experience in software engineering with a focus on ML infrastructure, distributed systems, or ML platform engineering
  • Hands-on expertise in distributed training frameworks (FSDP, DDP, ZeRO, or similar) and ML frameworks (PyTorch, Transformers, Lightning, TRL)
  • Strong understanding of GPU/accelerator performance optimization and scaling techniques
  • Experience designing and operating large-scale systems in production (cloud-native preferred)
  • Excellent problem-solving and communication skills, with the ability to work across infrastructure and ML boundaries
  • Experience building APIs, SDKs, or developer tools for ML workflows (nice to have)
  • Familiarity with cluster management and scheduling (Kubernetes, Ray, Slurm, etc.) (nice to have)
  • Knowledge of parameter-efficient fine-tuning methods (LoRA, QLoRA) and evaluation pipelines (nice to have)
  • Contributions to open-source distributed training or ML infra projects (nice to have)
  • Experience with cloud environments (AWS, GCP, Azure) and container orchestration (nice to have)
Safran

Principal Software Engineer, Connectivity

Safran
Leadfull-time$165k–$187k / yearCalifornia · 🇺🇸 United States
Posted: 46 minutes agoSource: apply.workable.com
JavaScriptPython
PrePass

Software Engineer

PrePass
Mid · Seniorfull-timeArizona · 🇺🇸 United States
Posted: 51 minutes agoSource: apply.workable.com
AzureCloud.NETSQL
Evertune AI

Senior Software Engineer, Full Stack

Evertune AI
Seniorfull-time$140k–$200k / yearNew York · 🇺🇸 United States
Posted: 1 hour agoSource: jobs.ashbyhq.com
CloudGoogle Cloud PlatformJavaScriptPythonVue.js
AirGarage

Senior Embedded Software Engineer

AirGarage
Seniorfull-time$180k–$210k / year🇺🇸 United States
Posted: 2 hours agoSource: jobs.ashbyhq.com
DockerGrafanaKafkaLinuxPrometheusPythonRedis
Newfire Global Partners

Staff Software Engineer

Newfire Global Partners
Leadfull-time🇺🇸 United States
Posted: 3 hours agoSource: newfireglobal.pinpointhq.com
AndroidiOSJavaScriptNode.jsReactReact Native