Salary
💰 $250,000 - $300,000 per year
About the role
- Availability Analysis: Own end-to-end unification of availability calculations across Lambda's data center products and various data center footprints, from the power/BMS/cooling and down into the rack/GPU level, and providing adequate telemetry back to facilities, site operations, and at the platform level
- Utilization Analysis and Oversubscription Strategy: Own end-to-end utilization analysis across Lambda's entire data center infrastructure; analyze DC designs to understand peak possible capacity under varying conditions; build oversubscription strategy and lead/own company workstream to maximize available MW w/o impacting GPU reliability and customer experience; ensure appropriate availability considerations are included
- Observability and Analytics: Coordinate with the observability team to ensure appropriate points are monitored to understand data center characteristics loads, especially under AI workloads; help the team understand where approximate warning/danger levels are; use observations and warning/danger levels to inform BOD for future Data Centers and suggest upgrades/modifications to current Data Centers; develop strategy for a data center fleet health dashboard; help provide structure ensuring overall day-to-day and long-term health can be understood from a 20k foot level with the ability to drill down into the details
- Power Capping Strategy and Implementation: Coordinate with Site Operations team to strategize and build out power capping capabilities, related to worst-case scenario response/protection as we start aggressively employing oversubscription; identify appropriate IT blocks where real-time data is monitored; analyze, propose, and implement a rigorous testing process that iteratively finds and eliminates stranded power and cooling capacity related to utilization
- Site Selection Technical Review: Conduct end-to-end technical evaluations of prospective data center sites, including power sufficiency and stability, cooling infrastructure and mechanical systems, and network topology feasibility; perform risk assessments and recommend sites based on infrastructure fit and growth capacity; coordinate with DC Infra, Legal, and Business Strategy teams to ensure site selections align with workload and deployment timelines
- Cluster-to-Facility Requirements Alignment: Collaborate with HPC Architecture team and Capacity Manager to translate cluster-level hardware and workload requirements into facility-level specifications; define infrastructure interface requirements (power, cooling, rack layouts, interconnects, monitoring) to ensure alignment between compute stack and facility capabilities; support long-term infrastructure roadmap development to accommodate future hardware designs, density shifts, and workload patterns; work with Capacity Manager to understand various levers that can be employed to accelerate growth during demand surges
- You: Self-starter with a proven ability to independently dive into the details to understand and solve hard problems across data center infrastructure and operations
- You: Ability to provide world-class analysis, boiling complex issues into the root cause or few key drivers
- You: 10+ years of experience working in directly in or closely with data center infrastructure and HPC/HW operations
- You: Deep familiarity with AI or compute workload patterns, scaling dynamics, and infrastructure cost drivers
- You: Ability to synthesize complex technical and business inputs into clear, actionable strategic recommendations
- You: Excellent communication and collaboration skills across technical, operational, and financial stakeholders
- Preferred Experience: Prior experience in hyperscale or cloud infrastructure environments
- Preferred Experience: Familiarity with GPU cluster sizing, workload forecasting, or energy-efficient compute architectures
- Preferred Experience: Working knowledge of typical Data Center Infrastructure designs, topologies, systems and associated reliability/availability calculations
- Preferred Experience: Knowledge of DCIM tools, telemetry systems, or utilization analytics platforms
- Preferred Experience: Engineering degree from university, Masters preferred
- Preferred Experience: Experience working across multi-disciplinary and non-technical teams to explain findings
Requirements
- Self-starter with a proven ability to independently dive into the details to understand and solve hard problems across data center infrastructure and operations
- Ability to provide world-class analysis, boiling complex issues into the root cause or few key drivers
- 10+ years of experience working in directly in or closely with data center infrastructure and HPC/HW operations
- Deep familiarity with AI or compute workload patterns, scaling dynamics, and infrastructure cost drivers
- Ability to synthesize complex technical and business inputs into clear, actionable strategic recommendations
- Excellent communication and collaboration skills across technical, operational, and financial stakeholders
- Prior experience in hyperscale or cloud infrastructure environments
- Familiarity with GPU cluster sizing, workload forecasting, or energy-efficient compute architectures
- Working knowledge of typical Data Center Infrastructure designs, topologies, systems and associated reliability/availability calculations
- Knowledge of DCIM tools, telemetry systems, or utilization analytics platforms
- Engineering degree from university, Masters preferred
- Experience working across multi-disciplinary and non-technical teams to explain findings