turbalance

AI Trace Generation Engineer

turbalance

full-time

Posted on:

Location Type: Hybrid

Location: HeidelbergGermany

Visit company website

Explore more

AI Apply
Apply

About the role

  • Design and implement a trace collection system for distributed LLM workloads
  • Validate that collected traces accurately reflect real workload behavior
  • Integrate with and instrument major LLM frameworks to extract meaningful execution data
  • Use collected traces as input to discrete event simulations
  • Analyze trace data to surface bottlenecks and inefficiencies across the stack

Requirements

  • 3+ years of experience in AI systems, ML infrastructure, or a closely related area
  • Hands-on experience with at least one major LLM serving or training framework
  • Strong proficiency in Python and C++
  • Solid understanding of GPU architecture, memory bandwidth, and the difference between compute-bound and memory-bound operations
  • Solid understanding of distributed communication
  • Familiarity with parallelism strategies and how they shape execution behavior across large clusters
  • Open source contributions or published research in relevant areas will definitely be appreciated
  • Previous startup experience is a plus
Benefits
  • Competitive compensation with a performance-based incentive
  • Subsidized Deutschlandticket
  • Access to a discount portal
  • Flexible hours with hybrid and remote-friendly options
  • Relocation support
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
PythonC++LLM frameworksGPU architecturememory bandwidthcompute-bound operationsmemory-bound operationsdistributed communicationparallelism strategiesdiscrete event simulations