
Research Engineer – Distributed Training
CloudWalk, Inc.
full-time
Posted on:
Location Type: Remote
Location: Remote • 🇧🇷 Brazil
Visit company websiteJob Level
Mid-LevelSenior
Tech Stack
KubernetesNode.jsPyTorch
About the role
- Design, implement, and maintain CloudWalk’s distributed LLM training pipeline.
- Orchestrate multi-node, multi-GPU runs across Kubernetes and internal clusters.
- Optimize performance, memory, and cost across large training workloads.
- Integrate cutting-edge frameworks (Unsloth, TorchTitan, Axolotl) into production workflows.
- Build internal tools and templates that accelerate research-to-production transitions.
- Collaborate with infra, research, and MLOps teams to ensure reliability and reproducibility.
Requirements
- Strong background in **PyTorch** and **distributed training** (DeepSpeed, FSDP, Accelerate).
- Hands-on experience with large-scale multi-GPU or multi-node training.
- Familiarity with **Transformers, Datasets, and mixed-precision techniques.**
- Understanding of **GPUs, containers, and schedulers **(Kubernetes, Slurm).
- Mindset for reliability, performance, and clean engineering.
Benefits
- Competitive salary
- Equity
- Opportunity to shape future AI infrastructure
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
PyTorchdistributed trainingDeepSpeedFSDPAccelerateTransformersDatasetsmixed-precision techniquesGPUsmulti-GPU training
Soft skills
reliabilityperformanceclean engineering