d-Matrix

Machine Learning Intern – Dynamic KV-Cache Modeling for Efficient LLM Inference

d-Matrix

internship

Posted on:

Location Type: Hybrid

Location: Santa ClaraCaliforniaUnited States

Visit company website

Explore more

AI Apply
Apply

Salary

💰 $30 - $59 per hour

Job Level

Tech Stack

About the role

  • Research and analyze existing KV-Cache implementations used in LLM inference, particularly those utilizing lists of past-key-values PyTorch tensors.
  • Investigate “Paged Attention” mechanisms that leverage dedicated CUDA data structures to optimize memory for variable sequence lengths.
  • Design and implement a torch-native dynamic KV-Cache model that can be integrated seamlessly within PyTorch.
  • Model KV-Cache behavior within the PyTorch compute graph to improve compatibility with torch.compile and facilitate the export of the compute graph.
  • Conduct experiments to evaluate memory utilization and inference efficiency on D-Matrix hardware.

Requirements

  • Currently pursuing a degree in Computer Science, Electrical Engineering, Machine Learning, or a related field.
  • Familiarity with PyTorch and deep learning concepts, particularly regarding model optimization and memory management.
  • Understanding of CUDA programming and hardware-accelerated computation (experience with CUDA is a plus).
  • Strong programming skills in Python, with experience in PyTorch.
  • Analytical mindset with the ability to approach problems creatively.
Benefits
  • 📊 Check your resume score for this job Improve your chances of getting an interview by checking your resume score before you apply. Check Resume Score
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
PyTorchCUDAPythondeep learningmodel optimizationmemory managementKV-CachePaged AttentionD-Matrix hardwaretorch.compile
Soft Skills
analytical mindsetcreative problem solving