Aleph Alpha

Senior AI Engineer – Pre-training Data

Aleph Alpha

full-time

Posted on:

Location Type: Hybrid

Location: HeidelbergGermany

Visit company website

Explore more

AI Apply
Apply

Job Level

About the role

  • Co-Own data pipelines end-to-end: Design, build, and maintain the infrastructure that sources, processes, deduplicates, filters, and prepares pre-training corpora at scale. Own the conversion from curated corpora to training-ready streaming formats.
  • Curate and compose data mixtures: Define and iterate on the data blends used for pre-training - balancing domains, languages, quality tiers, and licensing requirements to maximise model capability.
  • Build data quality tooling: Develop classifiers, heuristics, and analysis frameworks that measure and enforce data quality across terabyte-scale corpora. Monitor pipeline health and data quality metrics at scale.
  • Close data gaps: Work with evaluation and post-training teams to identify where model weaknesses trace back to data coverage, then source or generate the data needed to address them.
  • Collaborate with post-training: Partner closely with the post-training team to ensure pre-training data decisions support downstream fine-tuning, alignment, and deployment goals - data choices upstream shape what's possible downstream.
  • Co-Own German-language data: Ensure deep, high-quality coverage of German-language corpora - this is core to our value proposition, not an afterthought.
  • Establish data-to-performance signal: Design and run ablation studies to validate data choices - measuring how changes in composition, filtering, or sourcing affect pre-training evaluation metrics and downstream capabilities.
  • Take data transparency seriously: Maintain data lineage and provenance so the team knows exactly what went into each training run.

Requirements

  • Track record of shipping impactful technical work - whether that's research, infrastructure, or both.
  • Strong Python skills and comfort with data engineering and ML infrastructure, including experience with deep learning frameworks, workflow orchestration, object storage, columnar data formats, and distributed processing.
  • Ability to reason about what a dataset contributes to model training and whether it matters - not just process data, but understand it.
  • Ownership mentality: you see problems through from diagnosis to solution to deployment.
  • Willingness to relocate to Heidelberg or travel at least fortnightly.
  • Experience with large-scale data processing for ML, including corpus sourcing, curation, cleaning, deduplication, and filtering.
  • Familiarity with data quality methods: classifier-based filtering, heuristic scoring, perplexity-based selection, and decontamination.
  • Understanding of foundation model training - how data composition, scale, and mixing ratios affect capabilities.
  • Experience with web-scale data sourcing and crawl processing (e.g., Common Crawl, WARC pipelines).
  • Rust proficiency (parts of our data pipeline are performance-critical).
  • Infrastructure knowledge - experience with Kubernetes, container orchestration, or cloud-native ML infrastructure.
  • PhD in machine learning, NLP, data engineering, or a related field (valued but not required - we care about what you can do).
  • Bonus, but not required: German language proficiency can be helpful for curating and assessing German-language data.
Benefits
  • 30 days of paid vacation
  • Access to a variety of fitness & wellness offerings via Wellhub
  • Mental health support through nilo.health
  • Substantially subsidized company pension plan for your future security
  • Subsidized Germany-wide transportation ticket
  • Budget for additional technical equipment
  • Flexible working hours for better work-life balance and hybrid working model
  • Virtual Stock Option Plan
  • JobRad® Bike Lease
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
Pythondata engineeringdeep learning frameworksworkflow orchestrationobject storagecolumnar data formatsdistributed processingdata quality methodsRustfoundation model training
Soft Skills
ownership mentalityproblem-solvingcollaborationcritical thinking
Certifications
PhD in machine learningPhD in NLPPhD in data engineering