Aleph Alpha

AI Software Engineer – Model Evaluation

Aleph Alpha

full-time

Posted on:

Location Type: Hybrid

Location: Heidelberg • 🇩🇪 Germany

Visit company website
AI Apply
Apply

Job Level

Mid-LevelSenior

Tech Stack

Distributed SystemsPyTorch

About the role

  • As an AI Software Engineer in Model Evaluation, you will help design, implement, and scale the systems that measure our models’ performance at the cutting edge.
  • You will work closely with researchers to create evaluation benchmarks, datasets, and environments that test model capabilities, safety, and reliability across tasks from multilingual understanding to mathematical reasoning and creativity.
  • You will own significant portions of our evaluation infrastructure, including dataset generation pipelines, automated benchmarking tools, analysis dashboards, and large-scale evaluation orchestration on our compute clusters.
  • You’ll be building tools and experiments that drive product decisions, shape research priorities, and guide responsible deployment of our models.
  • This is high-scale, high-impact engineering: you’ll work with petabyte-scale data, run evaluations across large-scale distributed GPU clusters, and deliver insights that inform the direction of Aleph Alpha’s research.

Requirements

  • Bachelor’s degree in computer science, engineering, or a related field.
  • Willingness to work in Germany. Our primary working locations are Heidelberg (preferred) and Berlin, although there is some flexibility to work from other locations in Germany, with regular travel to Heidelberg expected regularly, potentially weekly.
  • Proficiency in programming and a passion for crafting high-quality, maintainable software while following engineering best practices (e.g., TDD, DDD).
  • Curiosity to dig deep into how models work and how to measure their capabilities.
  • Desire to take ownership of problems and collaborate with other teams to solve them.
  • Motivation to learn AI-related topics and get up-to-speed with the cutting edge.
  • Strong communication skills, with the ability to convey technical solutions to diverse audiences.
  • Master’s (or PhD) degree in computer science or related fields. (Preferred)
  • Familiarity with evaluation and benchmarking frameworks for AI models. (Preferred)
  • Experience working with distributed systems for large-scale data processing or evaluation orchestration. (Preferred)
  • Experience in dataset creation, annotation, and curation for complex AI tasks. (Preferred)
  • Familiarity with LLM architectures, popular NLP tools (e.g., PyTorch, HF Transformers), and automated evaluation techniques (e.g., LLM-as-a-judge, multi-turn evaluation). (Preferred)
  • Experience designing evaluations for safety, trustworthiness, and bias in AI systems. (Preferred)
  • Strong skills in data visualization, dashboarding, and reporting for evaluation results. (Preferred)
  • Familiarity with cluster management systems, model/data lineage, and MLOps workflows. (Preferred)
Benefits
  • 30 days of paid vacation
  • Access to a variety of fitness & wellness offerings via Wellhub
  • Mental health support through nilo.health
  • Substantially subsidized company pension plan for your future security
  • Subsidized Germany-wide transportation ticket
  • Budget for additional technical equipment
  • Flexible working hours for better work-life balance and hybrid working model
  • Virtual Stock Option Plan
  • JobRad® Bike Lease

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills
programmingTDDDDDevaluation frameworksbenchmarking frameworksdistributed systemsdataset creationdata visualizationMLOps workflowsLLM architectures
Soft skills
communicationcuriosityownershipcollaborationmotivation
Certifications
Bachelor’s degreeMaster’s degreePhD