Reddit, Inc.

Senior Machine Learning Engineer, ML Training Platform

Reddit, Inc.

full-time

Posted on:

Location Type: Remote

Location: United States

Visit company website

Explore more

AI Apply
Apply

Salary

💰 $216,700 - $303,400 per year

Job Level

About the role

  • Lead the building, testing, and maintenance of ML training infrastructure at Reddit.
  • Play a pivotal role in designing, building, and optimizing the infrastructure and tooling required to support large-scale machine learning workflows.
  • Evolve the MLE experience, from provisioning interactive GPU environments through large-scale training, supporting on-demand and self-service workflows.
  • Kubernetes Automation: Write custom Kubernetes Controllers and Operators to manage the lifecycle of interactive Jupyter workspaces and long-running ML training jobs, handle auto-idling, and ensure fault tolerance.
  • GPU Orchestration: Work with the underlying compute team to ensure MLEs have efficient access to training hardware resources and handle resource contention gracefully.
  • Developer Experience (DevX): Treat internal MLEs as your customers. Conduct user research, reduce friction in the "Idea-to-Prototype" loop, and standardize software environments (Docker images, Python dependency management).

Requirements

  • 5+ years of software engineering experience, with a focus on Platform Engineering, ML Infrastructure, or Backend Systems.
  • Deep Kubernetes Expertise: You know K8s beyond just "deploying pods." You understand CRDs, Controllers and the Operator pattern.
  • Jupyter Ecosystem Knowledge: Experience customizing JupyterHub, JupyterLab extensions, or building similar interactive computing platforms.
  • Strong Coding Skills: Proficiency in Python (for the ML ecosystem) and Go (for Kubernetes controllers/infrastructure tooling).
  • GPU Experience: Hands-on practice with CUDA environments, GPU virtualization/containerization, and doing it all within Kubernetes.
  • Cloud Provider Experience: Familiarity with both managed ML offerings (Vertex AI, Sagemaker, etc) and building custom ML components in AWS and/or GCP.
  • Experience working with distributed training frameworks, including Ray and Kubernetes.
  • Comfortable with distributed systems, big data (Petabyte scale) and data-intensive systems.
  • Strong focus on scalability, reliability, performance, and ease of use. You are an undying advocate for platform users and have a deep intuition for the machine learning development lifecycle.
  • Strong organizational & communication skills.
Benefits
  • Comprehensive Healthcare Benefits and Income Replacement Programs
  • 401k Match
  • Family Planning Support
  • Gender-Affirming Care
  • Mental Health & Coaching Benefits
  • Flexible Vacation & Reddit Global Days off
  • Generous paid Parental Leave
  • Paid Volunteer time off
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
KubernetesPythonGoCUDAJupyterHubJupyterLabRayAWSGCPdistributed training
Soft Skills
organizational skillscommunication skillsuser researchcustomer advocacyscalability focusreliability focusperformance focusease of use focus