Senior AI Engineer – Pre-training Data

Aleph Alpha

full-time

Posted on: 4/13/2026

Location Type: Hybrid

Location: Heidelberg • Germany

Visit company website

Explore more

✨ AI Apply

Apply

Job Level

Senior

Tech Stack

Cloud Kubernetes Python Rust

About the role

Co-Own data pipelines end-to-end: Design, build, and maintain the infrastructure that sources, processes, deduplicates, filters, and prepares pre-training corpora at scale. Own the conversion from curated corpora to training-ready streaming formats.
Curate and compose data mixtures: Define and iterate on the data blends used for pre-training - balancing domains, languages, quality tiers, and licensing requirements to maximise model capability.
Build data quality tooling: Develop classifiers, heuristics, and analysis frameworks that measure and enforce data quality across terabyte-scale corpora. Monitor pipeline health and data quality metrics at scale.
Close data gaps: Work with evaluation and post-training teams to identify where model weaknesses trace back to data coverage, then source or generate the data needed to address them.
Collaborate with post-training: Partner closely with the post-training team to ensure pre-training data decisions support downstream fine-tuning, alignment, and deployment goals - data choices upstream shape what's possible downstream.
Co-Own German-language data: Ensure deep, high-quality coverage of German-language corpora - this is core to our value proposition, not an afterthought.
Establish data-to-performance signal: Design and run ablation studies to validate data choices - measuring how changes in composition, filtering, or sourcing affect pre-training evaluation metrics and downstream capabilities.
Take data transparency seriously: Maintain data lineage and provenance so the team knows exactly what went into each training run.

Requirements

Track record of shipping impactful technical work - whether that's research, infrastructure, or both.
Strong Python skills and comfort with data engineering and ML infrastructure, including experience with deep learning frameworks, workflow orchestration, object storage, columnar data formats, and distributed processing.
Ability to reason about what a dataset contributes to model training and whether it matters - not just process data, but understand it.
Ownership mentality: you see problems through from diagnosis to solution to deployment.
Willingness to relocate to Heidelberg or travel at least fortnightly.
Experience with large-scale data processing for ML, including corpus sourcing, curation, cleaning, deduplication, and filtering.
Familiarity with data quality methods: classifier-based filtering, heuristic scoring, perplexity-based selection, and decontamination.
Understanding of foundation model training - how data composition, scale, and mixing ratios affect capabilities.
Experience with web-scale data sourcing and crawl processing (e.g., Common Crawl, WARC pipelines).
Rust proficiency (parts of our data pipeline are performance-critical).
Infrastructure knowledge - experience with Kubernetes, container orchestration, or cloud-native ML infrastructure.
PhD in machine learning, NLP, data engineering, or a related field (valued but not required - we care about what you can do).
Bonus, but not required: German language proficiency can be helpful for curating and assessing German-language data.

Benefits

30 days of paid vacation
Access to a variety of fitness & wellness offerings via Wellhub
Mental health support through nilo.health
Substantially subsidized company pension plan for your future security
Subsidized Germany-wide transportation ticket
Budget for additional technical equipment
Flexible working hours for better work-life balance and hybrid working model
Virtual Stock Option Plan
JobRad® Bike Lease

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

Pythondata engineeringdeep learning frameworksworkflow orchestrationobject storagecolumnar data formatsdistributed processingdata quality methodsRustfoundation model training

Soft Skills

ownership mentalityproblem-solvingcollaborationcritical thinking

Certifications

PhD in machine learningPhD in NLPPhD in data engineering