
Senior AI Engineer – Pre-training Data
Aleph Alpha
full-time
Posted on:
Location Type: Hybrid
Location: Heidelberg • Germany
Visit company websiteExplore more
Job Level
Tech Stack
About the role
- Co-Own data pipelines end-to-end: Design, build, and maintain the infrastructure that sources, processes, deduplicates, filters, and prepares pre-training corpora at scale. Own the conversion from curated corpora to training-ready streaming formats.
- Curate and compose data mixtures: Define and iterate on the data blends used for pre-training - balancing domains, languages, quality tiers, and licensing requirements to maximise model capability.
- Build data quality tooling: Develop classifiers, heuristics, and analysis frameworks that measure and enforce data quality across terabyte-scale corpora. Monitor pipeline health and data quality metrics at scale.
- Close data gaps: Work with evaluation and post-training teams to identify where model weaknesses trace back to data coverage, then source or generate the data needed to address them.
- Collaborate with post-training: Partner closely with the post-training team to ensure pre-training data decisions support downstream fine-tuning, alignment, and deployment goals - data choices upstream shape what's possible downstream.
- Co-Own German-language data: Ensure deep, high-quality coverage of German-language corpora - this is core to our value proposition, not an afterthought.
- Establish data-to-performance signal: Design and run ablation studies to validate data choices - measuring how changes in composition, filtering, or sourcing affect pre-training evaluation metrics and downstream capabilities.
- Take data transparency seriously: Maintain data lineage and provenance so the team knows exactly what went into each training run.
Requirements
- Track record of shipping impactful technical work - whether that's research, infrastructure, or both.
- Strong Python skills and comfort with data engineering and ML infrastructure, including experience with deep learning frameworks, workflow orchestration, object storage, columnar data formats, and distributed processing.
- Ability to reason about what a dataset contributes to model training and whether it matters - not just process data, but understand it.
- Ownership mentality: you see problems through from diagnosis to solution to deployment.
- Willingness to relocate to Heidelberg or travel at least fortnightly.
- Experience with large-scale data processing for ML, including corpus sourcing, curation, cleaning, deduplication, and filtering.
- Familiarity with data quality methods: classifier-based filtering, heuristic scoring, perplexity-based selection, and decontamination.
- Understanding of foundation model training - how data composition, scale, and mixing ratios affect capabilities.
- Experience with web-scale data sourcing and crawl processing (e.g., Common Crawl, WARC pipelines).
- Rust proficiency (parts of our data pipeline are performance-critical).
- Infrastructure knowledge - experience with Kubernetes, container orchestration, or cloud-native ML infrastructure.
- PhD in machine learning, NLP, data engineering, or a related field (valued but not required - we care about what you can do).
- Bonus, but not required: German language proficiency can be helpful for curating and assessing German-language data.
Benefits
- 30 days of paid vacation
- Access to a variety of fitness & wellness offerings via Wellhub
- Mental health support through nilo.health
- Substantially subsidized company pension plan for your future security
- Subsidized Germany-wide transportation ticket
- Budget for additional technical equipment
- Flexible working hours for better work-life balance and hybrid working model
- Virtual Stock Option Plan
- JobRad® Bike Lease
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
Pythondata engineeringdeep learning frameworksworkflow orchestrationobject storagecolumnar data formatsdistributed processingdata quality methodsRustfoundation model training
Soft Skills
ownership mentalityproblem-solvingcollaborationcritical thinking
Certifications
PhD in machine learningPhD in NLPPhD in data engineering