Salary
💰 $349,000 - $523,000 per year
About the role
- Architect high-performance networking solutions that power cloud platforms, with a focus on ultra-low-latency and high-bandwidth connectivity
- Define the network topology and architectural patterns for large-scale GPU clusters, storage backends, and multi-tenant environments
- Evaluate, benchmark, and select next-generation network technologies (e.g., InfiniBand NDR/XDR, RoCE, 400G/800G Ethernet) to meet AI workload requirements
- Develop and maintain network architecture standards, reference designs, and scalability roadmaps for multi-site and hybrid environments
- Partner with compute and storage architects to ensure seamless end-to-end data flow and fault tolerance
- Guide network automation strategies and tooling to enable efficient provisioning, telemetry, and operational visibility
- Mentor engineers and cross-functional teams on advanced network concepts, troubleshooting, and architectural best practices
Requirements
- Proven experience (7+ years) architecting high-performance data center networks, preferably for HPC, AI/ML, or large-scale cloud infrastructure
- Deep expertise with InfiniBand (HDR/NDR) and advanced Ethernet fabrics, including RoCE and RDMA protocols
- Strong understanding of data center switching architectures, congestion control, QoS, and network virtualization (VXLAN, EVPN)
- Skilled in designing for low-latency and high-throughput data paths, including GPU-to-GPU and storage traffic optimization
- Proficient with routing/switching protocols (BGP, OSPF) and software-defined networking (SDN) concepts
- Experience building resilient, fault-tolerant network architectures with redundancy, failover, and high availability
- Excellent communication and leadership skills, capable of influencing technical decisions across diverse teams
- Willing and able to work onsite at our San Francisco office 4 days per week (Lambda’s designated work from home day is Tuesday)
- Nice to have: Hands-on experience with AI workload profiling, collective communication patterns (e.g., NCCL, MPI), and network tuning for distributed training
- Nice to have: Familiarity with network automation frameworks and telemetry tools
- Nice to have: Exposure to DPU/SmartNIC technologies, including NVIDIA BlueField, or similar
- Nice to have: Knowledge of large-scale, multi-site interconnect design, including DWDM or metro/long-haul networking
- Nice to have: Experience collaborating with hyperscale or enterprise customers on highly customized network designs