
Machine Learning Intern – Dynamic KV-Cache Modeling for Efficient LLM Inference
d-Matrix
internship
Posted on:
Location Type: Hybrid
Location: Santa Clara • California • United States
Visit company websiteExplore more
Salary
💰 $30 - $59 per hour
Job Level
About the role
- Research and analyze existing KV-Cache implementations used in LLM inference, particularly those utilizing lists of past-key-values PyTorch tensors.
- Investigate “Paged Attention” mechanisms that leverage dedicated CUDA data structures to optimize memory for variable sequence lengths.
- Design and implement a torch-native dynamic KV-Cache model that can be integrated seamlessly within PyTorch.
- Model KV-Cache behavior within the PyTorch compute graph to improve compatibility with torch.compile and facilitate the export of the compute graph.
- Conduct experiments to evaluate memory utilization and inference efficiency on D-Matrix hardware.
Requirements
- Currently pursuing a degree in Computer Science, Electrical Engineering, Machine Learning, or a related field.
- Familiarity with PyTorch and deep learning concepts, particularly regarding model optimization and memory management.
- Understanding of CUDA programming and hardware-accelerated computation (experience with CUDA is a plus).
- Strong programming skills in Python, with experience in PyTorch.
- Analytical mindset with the ability to approach problems creatively.
Benefits
- 📊 Check your resume score for this job Improve your chances of getting an interview by checking your resume score before you apply. Check Resume Score
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
PyTorchCUDAPythondeep learningmodel optimizationmemory managementKV-CachePaged AttentionD-Matrix hardwaretorch.compile
Soft Skills
analytical mindsetcreative problem solving