
Senior Machine Learning Engineer, ML Training Platform
Reddit, Inc.
full-time
Posted on:
Location Type: Remote
Location: United States
Visit company websiteExplore more
Salary
💰 $216,700 - $303,400 per year
Job Level
About the role
- Lead the building, testing, and maintenance of ML training infrastructure at Reddit.
- Play a pivotal role in designing, building, and optimizing the infrastructure and tooling required to support large-scale machine learning workflows.
- Evolve the MLE experience, from provisioning interactive GPU environments through large-scale training, supporting on-demand and self-service workflows.
- Kubernetes Automation: Write custom Kubernetes Controllers and Operators to manage the lifecycle of interactive Jupyter workspaces and long-running ML training jobs, handle auto-idling, and ensure fault tolerance.
- GPU Orchestration: Work with the underlying compute team to ensure MLEs have efficient access to training hardware resources and handle resource contention gracefully.
- Developer Experience (DevX): Treat internal MLEs as your customers. Conduct user research, reduce friction in the "Idea-to-Prototype" loop, and standardize software environments (Docker images, Python dependency management).
Requirements
- 5+ years of software engineering experience, with a focus on Platform Engineering, ML Infrastructure, or Backend Systems.
- Deep Kubernetes Expertise: You know K8s beyond just "deploying pods." You understand CRDs, Controllers and the Operator pattern.
- Jupyter Ecosystem Knowledge: Experience customizing JupyterHub, JupyterLab extensions, or building similar interactive computing platforms.
- Strong Coding Skills: Proficiency in Python (for the ML ecosystem) and Go (for Kubernetes controllers/infrastructure tooling).
- GPU Experience: Hands-on practice with CUDA environments, GPU virtualization/containerization, and doing it all within Kubernetes.
- Cloud Provider Experience: Familiarity with both managed ML offerings (Vertex AI, Sagemaker, etc) and building custom ML components in AWS and/or GCP.
- Experience working with distributed training frameworks, including Ray and Kubernetes.
- Comfortable with distributed systems, big data (Petabyte scale) and data-intensive systems.
- Strong focus on scalability, reliability, performance, and ease of use. You are an undying advocate for platform users and have a deep intuition for the machine learning development lifecycle.
- Strong organizational & communication skills.
Benefits
- Comprehensive Healthcare Benefits and Income Replacement Programs
- 401k Match
- Family Planning Support
- Gender-Affirming Care
- Mental Health & Coaching Benefits
- Flexible Vacation & Reddit Global Days off
- Generous paid Parental Leave
- Paid Volunteer time off
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
KubernetesPythonGoCUDAJupyterHubJupyterLabRayAWSGCPdistributed training
Soft Skills
organizational skillscommunication skillsuser researchcustomer advocacyscalability focusreliability focusperformance focusease of use focus