Aldea

Data Engineer

Aldea

full-time

Posted on:

Location Type: Hybrid

Location: San Francisco • California • 🇺🇸 United States

Visit company website
AI Apply
Apply

Job Level

Mid-LevelSenior

Tech Stack

Distributed SystemsNode.jsPythonRaySpark

About the role

  • Build and scale data pipelines for pretraining, midtraining, and post-training at trillion+ token scale across language and speech domains
  • Process and curate large-scale datasets including cleaning, deduplication, quality filtering, and optimization for distributed training
  • Generate synthetic data for model training and evaluation across diverse tasks and domains
  • Design efficient data loading systems achieving high throughput across multi-node training clusters
  • Build data versioning and reproducibility systems to track dataset compositions and enable reproducible experiments
  • Collaborate with ML engineers and researchers to optimize pipelines and improve data quality

Requirements

  • Bachelor's degree in Computer Science, Engineering, or related field, or equivalent practical experience
  • 3+ years of experience building large-scale data pipelines for machine learning or data-intensive applications
  • Strong programming skills in Python and experience with data processing frameworks (Spark, Dask, Ray, or similar)
  • Experience with data quality techniques including deduplication, filtering, and validation at scale
  • Proven ability to optimize data pipelines for performance and throughput in distributed systems
  • Experience working with large datasets (100GB-10TB+) and understanding of storage systems and data formats
Benefits
  • Competitive base salary
  • Performance-based bonus aligned with research and model milestones
  • Equity participation
  • Comprehensive health, dental, and vision coverage
  • Flexible paid time off

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills
data pipelinesPythonSparkDaskRaydata processingdata quality techniquesdeduplicationfilteringvalidation
Soft skills
collaborationoptimizationproblem-solving
Certifications
Bachelor's degree in Computer ScienceBachelor's degree in Engineering