
Machine Learning Engineer – Training Optimization
Featherless AI
full-time
Posted on:
Location Type: Remote
Location: Anywhere in the World
Visit company websiteExplore more
Tech Stack
About the role
- Optimize large-scale model training pipelines (throughput, convergence, stability, and cost)
- Improve distributed training strategies (data, model, and pipeline parallelism)
- Tune optimizers, schedulers, batch sizing, and precision (bf16 / fp16 / fp8)
- Reduce training time and compute cost via profiling, bottleneck analysis, and systems-level improvements
- Collaborate with researchers on architecture-aware training strategies
- Build and maintain robust training infrastructure (checkpointing, fault tolerance, reproducibility)
- Evaluate and integrate new training techniques (e.g. gradient checkpointing, ZeRO, FSDP, custom kernels)
- Own training performance metrics and continuously push them forward
Requirements
- Strong experience training large neural networks (LLMs or similarly large models)
- Hands-on experience with training optimization (not just model usage)
- Solid understanding of:
- - Backpropagation, optimization algorithms, and training dynamics
- - Distributed systems for ML training
- Experience with PyTorch (required)
- Comfort working close to hardware (GPUs, memory, networking constraints)
- Ability to move fluidly between research ideas and production-ready code
- Nice to Have
- Experience with large-scale distributed training (multi-node, multi-GPU)
- Familiarity with DeepSpeed, FSDP, Megatron, or custom training stacks
- Experience optimizing training on AMD or NVIDIA GPUs
- Contributions to open-source ML infrastructure or research codebases
- Exposure to non-Transformer architectures (RNNs, hybrid models, etc.)
Benefits
- Competitive compensation + meaningful equity
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
large neural networkstraining optimizationbackpropagationoptimization algorithmstraining dynamicsdistributed systemsPyTorchDeepSpeedFSDPMegatron
Soft Skills
collaborationproblem-solvingadaptabilitycommunication