
Software Engineer – GPU Networking, Distributed Systems
Baseten
full-time
Posted on:
Location Type: Hybrid
Location: San Francisco • California • United States
Visit company websiteExplore more
Salary
💰 $150,000 - $250,000 per year
Tech Stack
About the role
- Make RDMA First-Class: Integrate RDMA/RoCE/InfiniBand capabilities into our inference stack.
- Optimize Distributed Inference: Implement and tune networking layers for Disaggregated KV Cache Offload and WideEP.
- Enable Serverless-Grade Startup Speeds for LLMs: Work with checkpointing and storage for sub-10-second startup for models.
- Deep-Dive into Hardware: Validate networking performance on bleeding-edge clusters and write acceptance tests.
- Build Observability: Design tools to visualize packet flow and diagnose distributed system behaviors.
- Optimize Kernels: Work with communication libraries (NCCL, NVSHMEM) and write custom kernels to overlap compute and data transfer.
Requirements
- Deep experience with high-performance networking protocols (InfiniBand, RoCE v2) and understand the physics of data movement.
- Fluent in C++ or Python, with the ability to bridge the gap between high-level logic and hardware.
- Deep understanding of the memory hierarchy in modern NVIDIA architectures (H100/Blackwell) and know how to optimize for it.
- Experience with NCCL, NVSHMEM, and UCX is highly preferred.
- Experience with GPUDirect Storage (GDS) or high-performance filesystems like Weka or 3FS.
- Familiarity with TensorRT-LLM, vLLM, or Sglang is a plus.
- Experience running low-level benchmarks to "qualify" new hardware clusters.
Benefits
- Competitive compensation, including meaningful equity.
- 100% coverage of medical, dental, and vision insurance for employee and dependents
- Generous PTO policy including company wide Winter Break (our offices are closed from Christmas Eve to New Year's Day!)
- Paid parental leave
- Company-facilitated 401(k)
- Exposure to a variety of ML startups, offering unparalleled learning and networking opportunities.
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
C++PythonRDMARoCEInfiniBandNCCLNVSHMEMUCXGPUDirect StorageTensorRT-LLM