
Research Engineer – RL Infrastructure
Prime Intellect
full-time
Posted on:
Location Type: Remote
Location: California • United States
Visit company websiteExplore more
About the role
- Build and optimize the systems infrastructure behind large-scale RL and distributed training workloads.
- Improve end-to-end training efficiency across compute, memory, networking, and scheduling layers.
- Design and implement low-level performance optimizations, including kernels, communication paths, and runtime improvements.
- Work on distributed training systems spanning data, tensor, and pipeline parallel workloads.
- Help shape the architecture of our RL training stack, including async rollout and post-training systems.
- Contribute to open-source libraries and internal infrastructure used for frontier-scale model training.
- Collaborate closely with researchers and infrastructure engineers to translate bottlenecks into concrete systems improvements.
- Stay at the frontier of training systems, inference systems, compiler/runtime tooling, and hardware-aware optimization techniques.
Requirements
- Strong systems engineering experience in AI/ML infrastructure, especially around large-scale model training or inference.
- Deep familiarity with PyTorch and distributed training frameworks such as PyTorch Distributed, DeepSpeed, FSDP, Megatron, vLLM, Ray, or related tooling.
- Experience optimizing training performance across kernels, memory movement, communication overhead, or parallelization strategy.
- Hands-on experience with large-scale training techniques including data parallelism, tensor parallelism, and pipeline parallelism.
- Strong understanding of GPU architecture, profiling, and performance debugging.
- Ability to identify bottlenecks across the stack and drive improvements from first principles.
- Comfort working in a fast-moving environment with ambiguous problems and high ownership.
Benefits
- Competitive compensation, including equity.
- Flexible work arrangements, with the option to work remotely or in person from our San Francisco office.
- Visa sponsorship and relocation support for international candidates.
- Quarterly team offsites, hackathons, conferences, and learning opportunities.
- A deeply technical, high-agency team working on infrastructure for open superintelligence.
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
systems engineeringAI/ML infrastructurelarge-scale model traininginferencePyTorchdistributed training frameworksDeepSpeedFSDPMegatronvLLM
Soft Skills
collaborationproblem-solvingownershipadaptability