CloudWalk, Inc.

Research Engineer – Distributed Training

CloudWalk, Inc.

full-time

Posted on:

Location Type: Remote

Location: Remote • 🇧🇷 Brazil

Visit company website
AI Apply
Apply

Job Level

Mid-LevelSenior

Tech Stack

KubernetesNode.jsPyTorch

About the role

  • Design, implement, and maintain CloudWalk’s distributed LLM training pipeline.
  • Orchestrate multi-node, multi-GPU runs across Kubernetes and internal clusters.
  • Optimize performance, memory, and cost across large training workloads.
  • Integrate cutting-edge frameworks (Unsloth, TorchTitan, Axolotl) into production workflows.
  • Build internal tools and templates that accelerate research-to-production transitions.
  • Collaborate with infra, research, and MLOps teams to ensure reliability and reproducibility.

Requirements

  • Strong background in **PyTorch** and **distributed training** (DeepSpeed, FSDP, Accelerate).
  • Hands-on experience with large-scale multi-GPU or multi-node training.
  • Familiarity with **Transformers, Datasets, and mixed-precision techniques.**
  • Understanding of **GPUs, containers, and schedulers **(Kubernetes, Slurm).
  • Mindset for reliability, performance, and clean engineering.
Benefits
  • Competitive salary
  • Equity
  • Opportunity to shape future AI infrastructure

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills
PyTorchdistributed trainingDeepSpeedFSDPAccelerateTransformersDatasetsmixed-precision techniquesGPUsmulti-GPU training
Soft skills
reliabilityperformanceclean engineering