Salary
💰 $180,000 - $300,000 per year
Tech Stack
LinuxNode.jsPythonPyTorch
About the role
- Build systems that let every team and every robot go faster: training more often, evaluating more reliably, and deploying better models to our growing fleet
- Transform prototypes into production-scale infrastructure for learning and inference, enabling larger training runs and maximizing edge compute utilization
- High agency and ownership on scaling capabilities in distributed training and/or inference
- Ensure that compute is never the bottleneck and we always have more compute available than data
- Enable large-scale (1000+ GPU) training on billion frames+ of robot data, including fault tolerance, distributed ops, and experiment management
- Optimize high-throughput datacenter scale distributed inference for world models, including building the world's fastest diffusion inference engine
- Improve low-latency on-device inference for robot policies with quantization, scheduling, distillation and more
Requirements
- You must be scaling-pilled, and believe that scale will enable humanoid robots to exist
- Python and/or C++ programming experience
- An intuitive understanding of training or inference scaling and what makes models run fast or slow
- Hands-on experience with distributed training (TorchTitan/Accelerate/DeepSpeed, FSDP/ZeRO, NCCL)
- Multi-node debugging and experiment management experience
- Depth in inference performance: TensorRT or similar graph compilers, batching/scheduling, and serving systems
- Real familiarity with quantization (PTQ, QAT; calibration strategies; INT8/FP8; libraries such as TensorRT ModelOpt, bitsandbytes, or equivalent)
- Experience writing or tuning CUDA/Triton kernels and leveraging vectorization, tensor cores, and memory hierarchy
- Familiarity with Linux, PyTorch, Triton/CUDA (Tech Stack: Linux Python / C++ PyTorch / TorchTitan / TensorRT Triton / CUDA)
- Degree in Computer Science or a related field (listed under Ideal Experiences)