Machine Learning Engineer – Training Optimization

Featherless AI

full-time

Posted on: 1/22/2026

Location Type: Remote

✨ AI Apply

About the role

Optimize large-scale model training pipelines (throughput, convergence, stability, and cost)
Improve distributed training strategies (data, model, and pipeline parallelism)
Tune optimizers, schedulers, batch sizing, and precision (bf16 / fp16 / fp8)
Reduce training time and compute cost via profiling, bottleneck analysis, and systems-level improvements
Collaborate with researchers on architecture-aware training strategies
Build and maintain robust training infrastructure (checkpointing, fault tolerance, reproducibility)
Evaluate and integrate new training techniques (e.g. gradient checkpointing, ZeRO, FSDP, custom kernels)
Own training performance metrics and continuously push them forward

Strong experience training large neural networks (LLMs or similarly large models)
Hands-on experience with training optimization (not just model usage)
Solid understanding of:
- Backpropagation, optimization algorithms, and training dynamics
- Distributed systems for ML training
Experience with PyTorch (required)
Comfort working close to hardware (GPUs, memory, networking constraints)
Ability to move fluidly between research ideas and production-ready code
Nice to Have
Experience with large-scale distributed training (multi-node, multi-GPU)
Familiarity with DeepSpeed, FSDP, Megatron, or custom training stacks
Experience optimizing training on AMD or NVIDIA GPUs
Contributions to open-source ML infrastructure or research codebases
Exposure to non-Transformer architectures (RNNs, hybrid models, etc.)

Benefits

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

large neural networkstraining optimizationbackpropagationoptimization algorithmstraining dynamicsdistributed systemsPyTorchDeepSpeedFSDPMegatron

Soft Skills

collaborationproblem-solvingadaptabilitycommunication