Featherless AI

Machine Learning Engineer – Training Optimization

Featherless AI

full-time

Posted on:

Location Type: Remote

Location: Anywhere in the World

Visit company website

Explore more

AI Apply
Apply

About the role

  • Optimize large-scale model training pipelines (throughput, convergence, stability, and cost)
  • Improve distributed training strategies (data, model, and pipeline parallelism)
  • Tune optimizers, schedulers, batch sizing, and precision (bf16 / fp16 / fp8)
  • Reduce training time and compute cost via profiling, bottleneck analysis, and systems-level improvements
  • Collaborate with researchers on architecture-aware training strategies
  • Build and maintain robust training infrastructure (checkpointing, fault tolerance, reproducibility)
  • Evaluate and integrate new training techniques (e.g. gradient checkpointing, ZeRO, FSDP, custom kernels)
  • Own training performance metrics and continuously push them forward

Requirements

  • Strong experience training large neural networks (LLMs or similarly large models)
  • Hands-on experience with training optimization (not just model usage)
  • Solid understanding of:
  • - Backpropagation, optimization algorithms, and training dynamics
  • - Distributed systems for ML training
  • Experience with PyTorch (required)
  • Comfort working close to hardware (GPUs, memory, networking constraints)
  • Ability to move fluidly between research ideas and production-ready code
  • Nice to Have
  • Experience with large-scale distributed training (multi-node, multi-GPU)
  • Familiarity with DeepSpeed, FSDP, Megatron, or custom training stacks
  • Experience optimizing training on AMD or NVIDIA GPUs
  • Contributions to open-source ML infrastructure or research codebases
  • Exposure to non-Transformer architectures (RNNs, hybrid models, etc.)
Benefits
  • Competitive compensation + meaningful equity
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
large neural networkstraining optimizationbackpropagationoptimization algorithmstraining dynamicsdistributed systemsPyTorchDeepSpeedFSDPMegatron
Soft Skills
collaborationproblem-solvingadaptabilitycommunication