
Staff Research Engineer – Pre-training Data
Reddit, Inc.
full-time
Posted on:
Location Type: Remote
Location: United States
Visit company websiteExplore more
Salary
💰 $230,000 - $322,000 per year
Job Level
About the role
- Architect and implement high-throughput, deterministic data sampling systems capable of feeding distributed training clusters at frontier-model scale.
- Design and execute dynamic curriculum learning strategies, creating systems that automatically adjust data distributions (text vs. multimodal) during training to improve model stability and reasoning capabilities.
- Engineer the logic for serializing Reddit’s complex conversational trees (threads, subreddits, cross-posts) into optimal training contexts, developing topological data processing strategies that preserve semantic relationships for model understanding.
- Formulate and validate statistical hypotheses regarding data mixtures, leveraging advanced sampling theory to minimize bias and maximize token quality.
- Design the "Safety-First" ingestion layer: Build automated pipelines for PII redaction, toxicity signals, and quality deduplication upstream of training, working closely with our Safety and Moderation Engineering counterparts.
- Bridge the gap between research and engineering by translating theoretical sampling insights into robust, low-latency production infrastructure.
- Mentor senior engineers and researchers on system design, numerical correctness, and performance optimization within distributed Python/Rust environments.
Requirements
- 8+ years of software engineering experience with a focus on machine learning infrastructure, data science at scale, or LLM pre-training.
- Expert proficiency in Python and distributed data processing frameworks (e.g., Ray Data, Spark, or custom high-performance dataloaders).
- Experience handling Unstructured and Semi-Structured data at scale (not just tabular data)—specifically text, code, images, and audio/video.
- Strong mathematical foundation in probability, statistics, and importance sampling theory.
- Deep understanding of pre-training dynamics and the impact of data quality/ordering on model performance.
- Experience working with Graph data structures or serializing conversation trees is highly valued.
Benefits
- Comprehensive Healthcare Benefits and Income Replacement Programs
- 401k with Employer Match
- Global Benefit programs that fit your lifestyle, from workspace to professional development to caregiving support
- Family Planning Support
- Gender-Affirming Care
- Mental Health & Coaching Benefits
- Flexible Vacation & Paid Volunteer Time Off
- Generous Paid Parental Leave
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
Pythondistributed data processingRay DataSparkhigh-performance dataloadersstatistical hypothesis validationsampling theorydata quality optimizationgraph data structuresmachine learning infrastructure
Soft Skills
mentoringsystem designnumerical correctnessperformance optimizationcollaboration