Research Engineer

• Design, implement, and maintain CloudWalk’s distributed LLM training pipeline.
• Orchestrate multi-node, multi-GPU runs across Kubernetes and internal clusters.
• Optimize performance, memory, and cost across large training workloads.
• Integrate cutting-edge frameworks (Unsloth, TorchTitan, Axolotl) into production workflows.
• Build internal tools and templates that accelerate research-to-production transitions.
• Collaborate with infra, research, and MLOps teams to ensure reliability and reproducibility.

Research Engineer – Distributed Training

Job Level

Tech Stack

About the role

Requirements

Applicant Tracking System Keywords

Hard skills

Soft skills