Data Center Operations Engineer

Lambda

full-time

Posted on: 8/30/2025

Location: California • 🇺🇸 United States

Visit company website

✨ AI Apply

Apply

Salary

💰 $250,000 - $300,000 per year

Job Level

SeniorLead

Tech Stack

Cloud

About the role

Availability Analysis: Own end-to-end unification of availability calculations across Lambda's data center products and various data center footprints, from the power/BMS/cooling and down into the rack/GPU level, and providing adequate telemetry back to facilities, site operations, and at the platform level
Utilization Analysis and Oversubscription Strategy: Own end-to-end utilization analysis across Lambda's entire data center infrastructure; analyze DC designs to understand peak possible capacity under varying conditions; build oversubscription strategy and lead/own company workstream to maximize available MW w/o impacting GPU reliability and customer experience; ensure appropriate availability considerations are included
Observability and Analytics: Coordinate with the observability team to ensure appropriate points are monitored to understand data center characteristics loads, especially under AI workloads; help the team understand where approximate warning/danger levels are; use observations and warning/danger levels to inform BOD for future Data Centers and suggest upgrades/modifications to current Data Centers; develop strategy for a data center fleet health dashboard; help provide structure ensuring overall day-to-day and long-term health can be understood from a 20k foot level with the ability to drill down into the details
Power Capping Strategy and Implementation: Coordinate with Site Operations team to strategize and build out power capping capabilities, related to worst-case scenario response/protection as we start aggressively employing oversubscription; identify appropriate IT blocks where real-time data is monitored; analyze, propose, and implement a rigorous testing process that iteratively finds and eliminates stranded power and cooling capacity related to utilization
Site Selection Technical Review: Conduct end-to-end technical evaluations of prospective data center sites, including power sufficiency and stability, cooling infrastructure and mechanical systems, and network topology feasibility; perform risk assessments and recommend sites based on infrastructure fit and growth capacity; coordinate with DC Infra, Legal, and Business Strategy teams to ensure site selections align with workload and deployment timelines
Cluster-to-Facility Requirements Alignment: Collaborate with HPC Architecture team and Capacity Manager to translate cluster-level hardware and workload requirements into facility-level specifications; define infrastructure interface requirements (power, cooling, rack layouts, interconnects, monitoring) to ensure alignment between compute stack and facility capabilities; support long-term infrastructure roadmap development to accommodate future hardware designs, density shifts, and workload patterns; work with Capacity Manager to understand various levers that can be employed to accelerate growth during demand surges
You: Self-starter with a proven ability to independently dive into the details to understand and solve hard problems across data center infrastructure and operations
You: Ability to provide world-class analysis, boiling complex issues into the root cause or few key drivers
You: 10+ years of experience working in directly in or closely with data center infrastructure and HPC/HW operations
You: Deep familiarity with AI or compute workload patterns, scaling dynamics, and infrastructure cost drivers
You: Ability to synthesize complex technical and business inputs into clear, actionable strategic recommendations
You: Excellent communication and collaboration skills across technical, operational, and financial stakeholders
Preferred Experience: Prior experience in hyperscale or cloud infrastructure environments
Preferred Experience: Familiarity with GPU cluster sizing, workload forecasting, or energy-efficient compute architectures
Preferred Experience: Working knowledge of typical Data Center Infrastructure designs, topologies, systems and associated reliability/availability calculations
Preferred Experience: Knowledge of DCIM tools, telemetry systems, or utilization analytics platforms
Preferred Experience: Engineering degree from university, Masters preferred
Preferred Experience: Experience working across multi-disciplinary and non-technical teams to explain findings

Requirements

Self-starter with a proven ability to independently dive into the details to understand and solve hard problems across data center infrastructure and operations
Ability to provide world-class analysis, boiling complex issues into the root cause or few key drivers
10+ years of experience working in directly in or closely with data center infrastructure and HPC/HW operations
Deep familiarity with AI or compute workload patterns, scaling dynamics, and infrastructure cost drivers
Ability to synthesize complex technical and business inputs into clear, actionable strategic recommendations
Excellent communication and collaboration skills across technical, operational, and financial stakeholders
Prior experience in hyperscale or cloud infrastructure environments
Familiarity with GPU cluster sizing, workload forecasting, or energy-efficient compute architectures
Working knowledge of typical Data Center Infrastructure designs, topologies, systems and associated reliability/availability calculations
Knowledge of DCIM tools, telemetry systems, or utilization analytics platforms
Engineering degree from university, Masters preferred
Experience working across multi-disciplinary and non-technical teams to explain findings