Aleph Alpha

Senior Performance Engineer – Pretraining

Aleph Alpha

full-time

Posted on:

Location Type: Hybrid

Location: HeidelbergGermany

Visit company website

Explore more

AI Apply
Apply

Job Level

About the role

  • Engineer the systems required to train foundation models at scale.
  • Maximize hardware utilization and training throughput on our large-scale GPU clusters.
  • Work at the intersection of deep learning frameworks, distributed systems, and GPU microarchitecture.

Requirements

  • Are proficient in Python and the PyTorch library.
  • Have a strong engineering background in parallel and/or distributed systems with proven track record of excellence.
  • Have hands-on experience with modern machine learning techniques (especially large language models and their life cycle).
  • Deeply understand the CUDA programming model.
  • Have experience in distributed programming with APIs like NCCL or MPI.
  • Have experience analysing profiling traces with tools such as PyTorch Profiler and Nvidia Nsight.
  • Please note this role requires regular on-site collaboration in Heidelberg as a member of the Training Efficiency Team.
Benefits
  • 30 days of paid vacation
  • Access to a variety of fitness & wellness offerings via Wellhub
  • Mental health support through nilo.health
  • JobRad® Bike Lease
  • Substantially subsidized company pension plan for your future security
  • Subsidized Germany-wide transportation ticket
  • Budget for additional technical equipment
  • Flexible working hours for better work-life balance and hybrid working model
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
PythonPyTorchCUDANCCLMPImachine learninglarge language modelsparallel systemsdistributed systemsprofiling
Soft Skills
engineering backgroundcollaborationexcellence