
AI Trace Generation Engineer
turbalance
full-time
Posted on:
Location Type: Hybrid
Location: Heidelberg • Germany
Visit company websiteExplore more
Tech Stack
About the role
- Design and implement a trace collection system for distributed LLM workloads
- Validate that collected traces accurately reflect real workload behavior
- Integrate with and instrument major LLM frameworks to extract meaningful execution data
- Use collected traces as input to discrete event simulations
- Analyze trace data to surface bottlenecks and inefficiencies across the stack
Requirements
- 3+ years of experience in AI systems, ML infrastructure, or a closely related area
- Hands-on experience with at least one major LLM serving or training framework
- Strong proficiency in Python and C++
- Solid understanding of GPU architecture, memory bandwidth, and the difference between compute-bound and memory-bound operations
- Solid understanding of distributed communication
- Familiarity with parallelism strategies and how they shape execution behavior across large clusters
- Open source contributions or published research in relevant areas will definitely be appreciated
- Previous startup experience is a plus
Benefits
- Competitive compensation with a performance-based incentive
- Subsidized Deutschlandticket
- Access to a discount portal
- Flexible hours with hybrid and remote-friendly options
- Relocation support
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
PythonC++LLM frameworksGPU architecturememory bandwidthcompute-bound operationsmemory-bound operationsdistributed communicationparallelism strategiesdiscrete event simulations