
Senior Performance Engineer – Pretraining
Aleph Alpha
full-time
Posted on:
Location Type: Hybrid
Location: Heidelberg • Germany
Visit company websiteExplore more
Job Level
Tech Stack
About the role
- Engineer the systems required to train foundation models at scale.
- Maximize hardware utilization and training throughput on our large-scale GPU clusters.
- Work at the intersection of deep learning frameworks, distributed systems, and GPU microarchitecture.
Requirements
- Are proficient in Python and the PyTorch library.
- Have a strong engineering background in parallel and/or distributed systems with proven track record of excellence.
- Have hands-on experience with modern machine learning techniques (especially large language models and their life cycle).
- Deeply understand the CUDA programming model.
- Have experience in distributed programming with APIs like NCCL or MPI.
- Have experience analysing profiling traces with tools such as PyTorch Profiler and Nvidia Nsight.
- Please note this role requires regular on-site collaboration in Heidelberg as a member of the Training Efficiency Team.
Benefits
- 30 days of paid vacation
- Access to a variety of fitness & wellness offerings via Wellhub
- Mental health support through nilo.health
- JobRad® Bike Lease
- Substantially subsidized company pension plan for your future security
- Subsidized Germany-wide transportation ticket
- Budget for additional technical equipment
- Flexible working hours for better work-life balance and hybrid working model
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
PythonPyTorchCUDANCCLMPImachine learninglarge language modelsparallel systemsdistributed systemsprofiling
Soft Skills
engineering backgroundcollaborationexcellence