Staff Compute Architect, HPC

Lambda

full-time

Posted on: 9/4/2025

Location: California • 🇺🇸 United States

✨ AI Apply

💰 $349,000 - $523,000 per year

Lead

CloudKubernetes

About the role

Architect and define scalable compute platforms optimized for AI/ML, simulation, and high-throughput workloads
Develop compute system standards and design patterns to ensure consistency, performance, and maintainability across infrastructure
Evaluate emerging CPU, GPU, and accelerator technologies and own architectural tradeoff decisions affecting compute density, power, cooling, and total cost
Collaborate with product and engineering teams to map workload requirements to compute platform capabilities across bare metal and cloud deployments
Define compute platform roadmaps and architectural reference designs guiding hardware selection, firmware baselines, and rack-level configurations
Act as a technical lead during new platform introductions, guiding validation and performance characterization efforts
Mentor systems engineers and cross-functional stakeholders on compute performance tuning, sizing, and architectural decisions

Proven experience (7+ years) architecting large-scale HPC or cloud compute platforms
Deep knowledge of CPU/GPU architectures, including system-on-chip integration, memory hierarchies, and accelerator topologies
Experience designing systems around high-bandwidth, low-latency fabrics (NVLink, InfiniBand)
Strong understanding of system performance tuning, resource scheduling, thermal and power optimization, and compute lifecycle management
Comfortable working across hardware and software boundaries, especially at the intersection of compute architecture, OS behavior, and orchestration layers
Skilled at balancing architectural tradeoffs for density, power efficiency, cooling, and performance
Strong analytical and communication skills, with a track record of influencing technical strategy across teams
Willingness and ability to work onsite at Lambda's San Francisco office 4 days per week
Nice to have: Hands-on experience with AI/ML workloads and their compute performance characteristics
Nice to have: Familiarity with orchestration tools used in HPC (Slurm, Kubernetes)
Nice to have: Experience with virtualization technologies, specifically GPU virtualization
Nice to have: Exposure to hardware validation, vendor collaboration, and long-term OEM roadmap alignment
Nice to have: Background in compute telemetry, real-time performance profiling, or large-scale A/B infrastructure testing