Decagon

Staff Software Engineer, ML Infrastructure

Decagon

full-time

Posted on:

Location Type: Hybrid

Location: San FranciscoCaliforniaUnited States

Visit company website

Explore more

AI Apply
Apply

Salary

💰 $300,000 - $430,000 per year

Job Level

About the role

  • Design and build distributed training platforms for LLM and multimodal fine-tuning and post-training at scale
  • Implement and integrate state-of-the-art training algorithms into production pipelines
  • Own inference architecture and multi-provider routing, including failover and optimization
  • Research and implement inference optimizations including quantization, speculative decoding, and batching strategies
  • Lead initiatives to improve latency and cost efficiency across the training and serving stack
  • Build evaluation and experimentation infrastructure that enables rapid, reliable iteration
  • Drive technical direction, mentor engineers, and establish best practices for ML infrastructure

Requirements

  • 8+ years building ML infrastructure or production systems at scale
  • Deep experience with distributed training: multi-node GPU clusters, fault tolerance, and optimization
  • Strong understanding of LLM inference: latency optimization, provider tradeoffs, and serving architecture
  • Proficiency in Python and modern ML frameworks (PyTorch, JAX, or TensorFlow)
  • Proven track record leading complex, multi-quarter technical projects
Benefits
  • Medical, dental, and vision benefits
  • Take what you need vacation policy
  • Daily lunches, dinners and snacks in the office to keep you at your best
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
distributed trainingmulti-node GPU clusterslatency optimizationinference architecturequantizationspeculative decodingbatching strategiesPythonPyTorchTensorFlow
Soft Skills
leadershipmentoringtechnical directionbest practices establishment