Salary
💰 $349,000 - $523,000 per year
Tech Stack
CloudKubernetes
About the role
- Architect and define scalable compute platforms optimized for AI/ML, simulation, and high-throughput workloads
- Develop compute system standards and design patterns to ensure consistency, performance, and maintainability across infrastructure
- Evaluate emerging CPU, GPU, and accelerator technologies and own architectural tradeoff decisions affecting compute density, power, cooling, and total cost
- Collaborate with product and engineering teams to map workload requirements to compute platform capabilities across bare metal and cloud deployments
- Define compute platform roadmaps and architectural reference designs guiding hardware selection, firmware baselines, and rack-level configurations
- Act as a technical lead during new platform introductions, guiding validation and performance characterization efforts
- Mentor systems engineers and cross-functional stakeholders on compute performance tuning, sizing, and architectural decisions
Requirements
- Proven experience (7+ years) architecting large-scale HPC or cloud compute platforms
- Deep knowledge of CPU/GPU architectures, including system-on-chip integration, memory hierarchies, and accelerator topologies
- Experience designing systems around high-bandwidth, low-latency fabrics (NVLink, InfiniBand)
- Strong understanding of system performance tuning, resource scheduling, thermal and power optimization, and compute lifecycle management
- Comfortable working across hardware and software boundaries, especially at the intersection of compute architecture, OS behavior, and orchestration layers
- Skilled at balancing architectural tradeoffs for density, power efficiency, cooling, and performance
- Strong analytical and communication skills, with a track record of influencing technical strategy across teams
- Willingness and ability to work onsite at Lambda's San Francisco office 4 days per week
- Nice to have: Hands-on experience with AI/ML workloads and their compute performance characteristics
- Nice to have: Familiarity with orchestration tools used in HPC (Slurm, Kubernetes)
- Nice to have: Experience with virtualization technologies, specifically GPU virtualization
- Nice to have: Exposure to hardware validation, vendor collaboration, and long-term OEM roadmap alignment
- Nice to have: Background in compute telemetry, real-time performance profiling, or large-scale A/B infrastructure testing