Salary
💰 $200,000 - $275,000 per year
Tech Stack
AWSAzureCloudDistributed SystemsGoogle Cloud PlatformKubernetesPyTorchRaySpark
About the role
- Design, build, and maintain distributed training infrastructure for large-scale foundation models
- Implement scalable pipelines for fine-tuning and training across heterogeneous GPU/accelerator clusters
- Optimize training performance through techniques like FSDP, DDP, ZeRO, and mixed precision training
- Contribute to frameworks and tooling that make training workflows efficient, reproducible, and developer-friendly
- Collaborate with cross-functional teams (Product, Forward Deployed Engineering, Inference Infra) to ensure training systems meet real-world requirements
- Research and apply emerging techniques in training efficiency and model adaptation, and productionize them in the Baseten platform
- Participate in code reviews, system design discussions, and technical deep dives to maintain a high engineering bar
Requirements
- Bachelor’s degree in Computer Science, Engineering, or related field, or equivalent experience
- 4+ years of experience in software engineering with a focus on ML infrastructure, distributed systems, or ML platform engineering
- Hands-on expertise in distributed training frameworks (FSDP, DDP, ZeRO, or similar) and ML frameworks (PyTorch, Transformers, Lightning, TRL)
- Strong understanding of GPU/accelerator performance optimization and scaling techniques
- Experience designing and operating large-scale systems in production (cloud-native preferred)
- Excellent problem-solving and communication skills, with the ability to work across infrastructure and ML boundaries
- Experience building APIs, SDKs, or developer tools for ML workflows (nice to have)
- Familiarity with cluster management and scheduling (Kubernetes, Ray, Slurm, etc.) (nice to have)
- Knowledge of parameter-efficient fine-tuning methods (LoRA, QLoRA) and evaluation pipelines (nice to have)
- Contributions to open-source distributed training or ML infra projects (nice to have)
- Experience with cloud environments (AWS, GCP, Azure) and container orchestration (nice to have)