Prime Intellect

Research Engineer – RL Infrastructure

Prime Intellect

full-time

Posted on:

Location Type: Remote

Location: CaliforniaUnited States

Visit company website

Explore more

AI Apply
Apply

Tech Stack

About the role

  • Build and optimize the systems infrastructure behind large-scale RL and distributed training workloads.
  • Improve end-to-end training efficiency across compute, memory, networking, and scheduling layers.
  • Design and implement low-level performance optimizations, including kernels, communication paths, and runtime improvements.
  • Work on distributed training systems spanning data, tensor, and pipeline parallel workloads.
  • Help shape the architecture of our RL training stack, including async rollout and post-training systems.
  • Contribute to open-source libraries and internal infrastructure used for frontier-scale model training.
  • Collaborate closely with researchers and infrastructure engineers to translate bottlenecks into concrete systems improvements.
  • Stay at the frontier of training systems, inference systems, compiler/runtime tooling, and hardware-aware optimization techniques.

Requirements

  • Strong systems engineering experience in AI/ML infrastructure, especially around large-scale model training or inference.
  • Deep familiarity with PyTorch and distributed training frameworks such as PyTorch Distributed, DeepSpeed, FSDP, Megatron, vLLM, Ray, or related tooling.
  • Experience optimizing training performance across kernels, memory movement, communication overhead, or parallelization strategy.
  • Hands-on experience with large-scale training techniques including data parallelism, tensor parallelism, and pipeline parallelism.
  • Strong understanding of GPU architecture, profiling, and performance debugging.
  • Ability to identify bottlenecks across the stack and drive improvements from first principles.
  • Comfort working in a fast-moving environment with ambiguous problems and high ownership.
Benefits
  • Competitive compensation, including equity.
  • Flexible work arrangements, with the option to work remotely or in person from our San Francisco office.
  • Visa sponsorship and relocation support for international candidates.
  • Quarterly team offsites, hackathons, conferences, and learning opportunities.
  • A deeply technical, high-agency team working on infrastructure for open superintelligence.
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
systems engineeringAI/ML infrastructurelarge-scale model traininginferencePyTorchdistributed training frameworksDeepSpeedFSDPMegatronvLLM
Soft Skills
collaborationproblem-solvingownershipadaptability