Reddit, Inc.

Staff Research Engineer – Pre-training Data

Reddit, Inc.

full-time

Posted on:

Location Type: Remote

Location: United States

Visit company website

Explore more

AI Apply
Apply

Salary

💰 $230,000 - $322,000 per year

Job Level

About the role

  • Architect and implement high-throughput, deterministic data sampling systems capable of feeding distributed training clusters at frontier-model scale.
  • Design and execute dynamic curriculum learning strategies, creating systems that automatically adjust data distributions (text vs. multimodal) during training to improve model stability and reasoning capabilities.
  • Engineer the logic for serializing Reddit’s complex conversational trees (threads, subreddits, cross-posts) into optimal training contexts, developing topological data processing strategies that preserve semantic relationships for model understanding.
  • Formulate and validate statistical hypotheses regarding data mixtures, leveraging advanced sampling theory to minimize bias and maximize token quality.
  • Design the "Safety-First" ingestion layer: Build automated pipelines for PII redaction, toxicity signals, and quality deduplication upstream of training, working closely with our Safety and Moderation Engineering counterparts.
  • Bridge the gap between research and engineering by translating theoretical sampling insights into robust, low-latency production infrastructure.
  • Mentor senior engineers and researchers on system design, numerical correctness, and performance optimization within distributed Python/Rust environments.

Requirements

  • 8+ years of software engineering experience with a focus on machine learning infrastructure, data science at scale, or LLM pre-training.
  • Expert proficiency in Python and distributed data processing frameworks (e.g., Ray Data, Spark, or custom high-performance dataloaders).
  • Experience handling Unstructured and Semi-Structured data at scale (not just tabular data)—specifically text, code, images, and audio/video.
  • Strong mathematical foundation in probability, statistics, and importance sampling theory.
  • Deep understanding of pre-training dynamics and the impact of data quality/ordering on model performance.
  • Experience working with Graph data structures or serializing conversation trees is highly valued.
Benefits
  • Comprehensive Healthcare Benefits and Income Replacement Programs
  • 401k with Employer Match
  • Global Benefit programs that fit your lifestyle, from workspace to professional development to caregiving support
  • Family Planning Support
  • Gender-Affirming Care
  • Mental Health & Coaching Benefits
  • Flexible Vacation & Paid Volunteer Time Off
  • Generous Paid Parental Leave
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
Pythondistributed data processingRay DataSparkhigh-performance dataloadersstatistical hypothesis validationsampling theorydata quality optimizationgraph data structuresmachine learning infrastructure
Soft Skills
mentoringsystem designnumerical correctnessperformance optimizationcollaboration