Salary
💰 $250,000 - $380,000 per year
Tech Stack
Distributed Systems
About the role
- Design and maintain standardized dataset APIs, including for multimodal data that cannot fit in memory
- Build proactive testing and scale validation pipelines for dataset loading at GPU scale
- Integrate datasets into training and inference pipelines, collaborating with multimodal researchers and infra teams
- Document and maintain dataset interfaces for discoverability and consistent adoption
- Establish safeguards and validation systems to ensure reproducibility of standardized datasets
- Debug and resolve performance bottlenecks in distributed dataset loading (e.g., stragglers)
- Provide visualization and inspection tools to surface errors, bugs, or bottlenecks
- Work on LLM training and inference infrastructure to support massive-scale GPU/accelerator fleets
Requirements
- Strong engineering fundamentals with experience in distributed systems, data pipelines, or infrastructure
- Experience building APIs, modular code, and scalable abstractions with attention to UX
- Comfortable debugging bottlenecks across large fleets of machines
- Collaborative, humble, and able to own foundational ML infrastructure
- Bonus: background in data math, probability, or distributed data theory
- Bonus: experience with GPU-scale distributed systems or dataset scaling for real-time data