
Training Infrastructure Engineer
Mirelo AI
full-time
Posted on:
Location Type: Hybrid
Location: Berlin • Germany
Visit company websiteExplore more
Tech Stack
About the role
- Focus on the full training stack - profiling GPU behavior, debugging training pipelines
- Improve throughput, choosing the right parallelism strategies
- Design the infrastructure for efficient model training at scale
- Work across cluster management, model training, efficient data pipelines, inference and optimizing PyTorch code
Requirements
- Familiarity with the latest and most effective techniques in optimizing training and inference workloads—not from reading papers, but from implementing them
- Deep understanding of GPU memory hierarchy and computation capabilities
- Experience optimizing for both memory-bound and compute-bound operations
- Expertise with efficient attention algorithms and their performance characteristics at different scales
- Nice to Have: Experience in implementing custom GPU kernels and integrating them into PyTorch
- Familiarity with high-performance storage solutions and understanding of their performance characteristics for ML workloads
- Experience with managing SLURM clusters at scale
Benefits
- Competitive compensation and equity
- True ownership from day one
- Join at a pivotal moment
- Build for the next generation of creators
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
GPU profilingdebugging training pipelinesparallelism strategiesmodel training infrastructuredata pipelinesPyTorch optimizationmemory-bound operationscompute-bound operationsefficient attention algorithmscustom GPU kernels