
Staff Software Engineer, ML Infrastructure
Decagon
full-time
Posted on:
Location Type: Hybrid
Location: San Francisco • California • United States
Visit company websiteExplore more
Salary
💰 $300,000 - $430,000 per year
Job Level
Tech Stack
About the role
- Design and build distributed training platforms for LLM and multimodal fine-tuning and post-training at scale
- Implement and integrate state-of-the-art training algorithms into production pipelines
- Own inference architecture and multi-provider routing, including failover and optimization
- Research and implement inference optimizations including quantization, speculative decoding, and batching strategies
- Lead initiatives to improve latency and cost efficiency across the training and serving stack
- Build evaluation and experimentation infrastructure that enables rapid, reliable iteration
- Drive technical direction, mentor engineers, and establish best practices for ML infrastructure
Requirements
- 8+ years building ML infrastructure or production systems at scale
- Deep experience with distributed training: multi-node GPU clusters, fault tolerance, and optimization
- Strong understanding of LLM inference: latency optimization, provider tradeoffs, and serving architecture
- Proficiency in Python and modern ML frameworks (PyTorch, JAX, or TensorFlow)
- Proven track record leading complex, multi-quarter technical projects
Benefits
- Medical, dental, and vision benefits
- Take what you need vacation policy
- Daily lunches, dinners and snacks in the office to keep you at your best
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
distributed trainingmulti-node GPU clusterslatency optimizationinference architecturequantizationspeculative decodingbatching strategiesPythonPyTorchTensorFlow
Soft Skills
leadershipmentoringtechnical directionbest practices establishment