Help world models train faster and run more efficiently.
Profile, optimize, and rearchitect systems that turn research ideas into models that run at scale and in real time.
Optimize training throughput across large GPU clusters.
Design and maintain distributed training infrastructure.
Profile and accelerate inference pipelines for real-time multimodal generation.
Optimize and scale training infrastructure to improve efficiency and reliability.
Contribute to the entire stack, from low-level kernel optimizations to high-level model design.

Requirements

4+ years of experience in systems engineering, ML infrastructure, or performance optimization for deep learning.
Familiarity with GPU kernel development (CUDA, Triton, CUTLASS) and distributed systems (NCCL, collective communication, model parallelism).
Experience with ML framework internals (PyTorch, JAX) and mixed-precision / low-precision techniques (FP8, INT8).
Experience building and operating large-scale training infrastructure, including fault tolerance and cluster orchestration.
Excitement about building AI that simulates the world — and making it performant enough to run in real time.
Bonus if you have experience with torch’s compilation feature.

Benefits

Salary range based on competitive market rates for our size, stage and industry.
Pay equity for our team.

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

performance optimizationGPU kernel developmentCUDATritonCUTLASSdistributed systemsNCCLPyTorchJAXmixed-precision techniques

Soft Skills

problem-solvingcollaborationcommunicationcreativityadaptability