
Data Engineer
Aldea
full-time
Posted on:
Location Type: Hybrid
Location: San Francisco • California • 🇺🇸 United States
Visit company websiteJob Level
Mid-LevelSenior
Tech Stack
Distributed SystemsNode.jsPythonRaySpark
About the role
- Build and scale data pipelines for pretraining, midtraining, and post-training at trillion+ token scale across language and speech domains
- Process and curate large-scale datasets including cleaning, deduplication, quality filtering, and optimization for distributed training
- Generate synthetic data for model training and evaluation across diverse tasks and domains
- Design efficient data loading systems achieving high throughput across multi-node training clusters
- Build data versioning and reproducibility systems to track dataset compositions and enable reproducible experiments
- Collaborate with ML engineers and researchers to optimize pipelines and improve data quality
Requirements
- Bachelor's degree in Computer Science, Engineering, or related field, or equivalent practical experience
- 3+ years of experience building large-scale data pipelines for machine learning or data-intensive applications
- Strong programming skills in Python and experience with data processing frameworks (Spark, Dask, Ray, or similar)
- Experience with data quality techniques including deduplication, filtering, and validation at scale
- Proven ability to optimize data pipelines for performance and throughput in distributed systems
- Experience working with large datasets (100GB-10TB+) and understanding of storage systems and data formats
Benefits
- Competitive base salary
- Performance-based bonus aligned with research and model milestones
- Equity participation
- Comprehensive health, dental, and vision coverage
- Flexible paid time off
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
data pipelinesPythonSparkDaskRaydata processingdata quality techniquesdeduplicationfilteringvalidation
Soft skills
collaborationoptimizationproblem-solving
Certifications
Bachelor's degree in Computer ScienceBachelor's degree in Engineering