Research and analyze existing KV-Cache implementations used in LLM inference, particularly those utilizing lists of past-key-values PyTorch tensors.
Investigate “Paged Attention” mechanisms that leverage dedicated CUDA data structures to optimize memory for variable sequence lengths.
Design and implement a torch-native dynamic KV-Cache model that can be integrated seamlessly within PyTorch.
Model KV-Cache behavior within the PyTorch compute graph to improve compatibility with torch.compile and facilitate the export of the compute graph.
Conduct experiments to evaluate memory utilization and inference efficiency on D-Matrix hardware.

Requirements

Currently pursuing a degree in Computer Science, Electrical Engineering, Machine Learning, or a related field.
Familiarity with PyTorch and deep learning concepts, particularly regarding model optimization and memory management.
Understanding of CUDA programming and hardware-accelerated computation (experience with CUDA is a plus).
Strong programming skills in Python, with experience in PyTorch.
Analytical mindset with the ability to approach problems creatively.

Benefits

📊 Check your resume score for this job Improve your chances of getting an interview by checking your resume score before you apply. Check Resume Score

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

PyTorchCUDAPythondeep learningmodel optimizationmemory managementKV-CachePaged AttentionD-Matrix hardwaretorch.compile

Soft Skills

analytical mindsetcreative problem solving