Architect and implement high-throughput, deterministic data sampling systems capable of feeding distributed training clusters at frontier-model scale.
Design and execute dynamic curriculum learning strategies, creating systems that automatically adjust data distributions (text vs. multimodal) during training to improve model stability and reasoning capabilities.
Engineer the logic for serializing Reddit’s complex conversational trees (threads, subreddits, cross-posts) into optimal training contexts, developing topological data processing strategies that preserve semantic relationships for model understanding.
Formulate and validate statistical hypotheses regarding data mixtures, leveraging advanced sampling theory to minimize bias and maximize token quality.
Design the "Safety-First" ingestion layer: Build automated pipelines for PII redaction, toxicity signals, and quality deduplication upstream of training, working closely with our Safety and Moderation Engineering counterparts.
Bridge the gap between research and engineering by translating theoretical sampling insights into robust, low-latency production infrastructure.
Mentor senior engineers and researchers on system design, numerical correctness, and performance optimization within distributed Python/Rust environments.

Requirements

8+ years of software engineering experience with a focus on machine learning infrastructure, data science at scale, or LLM pre-training.
Expert proficiency in Python and distributed data processing frameworks (e.g., Ray Data, Spark, or custom high-performance dataloaders).
Experience handling Unstructured and Semi-Structured data at scale (not just tabular data)—specifically text, code, images, and audio/video.
Strong mathematical foundation in probability, statistics, and importance sampling theory.
Deep understanding of pre-training dynamics and the impact of data quality/ordering on model performance.
Experience working with Graph data structures or serializing conversation trees is highly valued.

Benefits

Comprehensive Healthcare Benefits and Income Replacement Programs
401k with Employer Match
Global Benefit programs that fit your lifestyle, from workspace to professional development to caregiving support
Family Planning Support
Gender-Affirming Care
Mental Health & Coaching Benefits
Flexible Vacation & Paid Volunteer Time Off
Generous Paid Parental Leave

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

Pythondistributed data processingRay DataSparkhigh-performance dataloadersstatistical hypothesis validationsampling theorydata quality optimizationgraph data structuresmachine learning infrastructure

Soft Skills

mentoringsystem designnumerical correctnessperformance optimizationcollaboration