Lambda

Data Center Operations Engineer

Lambda

full-time

Posted on:

Origin:  • 🇺🇸 United States • California

Visit company website
AI Apply
Manual Apply

Salary

💰 $250,000 - $300,000 per year

Job Level

SeniorLead

Tech Stack

Cloud

About the role

  • Availability Analysis: Own end-to-end unification of availability calculations across Lambda's data center products and various data center footprints, from the power/BMS/cooling and down into the rack/GPU level, and providing adequate telemetry back to facilities, site operations, and at the platform level
  • Utilization Analysis and Oversubscription Strategy: Own end-to-end utilization analysis across Lambda's entire data center infrastructure; analyze DC designs to understand peak possible capacity under varying conditions; build oversubscription strategy and lead/own company workstream to maximize available MW w/o impacting GPU reliability and customer experience; ensure appropriate availability considerations are included
  • Observability and Analytics: Coordinate with the observability team to ensure appropriate points are monitored to understand data center characteristics loads, especially under AI workloads; help the team understand where approximate warning/danger levels are; use observations and warning/danger levels to inform BOD for future Data Centers and suggest upgrades/modifications to current Data Centers; develop strategy for a data center fleet health dashboard; help provide structure ensuring overall day-to-day and long-term health can be understood from a 20k foot level with the ability to drill down into the details
  • Power Capping Strategy and Implementation: Coordinate with Site Operations team to strategize and build out power capping capabilities, related to worst-case scenario response/protection as we start aggressively employing oversubscription; identify appropriate IT blocks where real-time data is monitored; analyze, propose, and implement a rigorous testing process that iteratively finds and eliminates stranded power and cooling capacity related to utilization
  • Site Selection Technical Review: Conduct end-to-end technical evaluations of prospective data center sites, including power sufficiency and stability, cooling infrastructure and mechanical systems, and network topology feasibility; perform risk assessments and recommend sites based on infrastructure fit and growth capacity; coordinate with DC Infra, Legal, and Business Strategy teams to ensure site selections align with workload and deployment timelines
  • Cluster-to-Facility Requirements Alignment: Collaborate with HPC Architecture team and Capacity Manager to translate cluster-level hardware and workload requirements into facility-level specifications; define infrastructure interface requirements (power, cooling, rack layouts, interconnects, monitoring) to ensure alignment between compute stack and facility capabilities; support long-term infrastructure roadmap development to accommodate future hardware designs, density shifts, and workload patterns; work with Capacity Manager to understand various levers that can be employed to accelerate growth during demand surges
  • You: Self-starter with a proven ability to independently dive into the details to understand and solve hard problems across data center infrastructure and operations
  • You: Ability to provide world-class analysis, boiling complex issues into the root cause or few key drivers
  • You: 10+ years of experience working in directly in or closely with data center infrastructure and HPC/HW operations
  • You: Deep familiarity with AI or compute workload patterns, scaling dynamics, and infrastructure cost drivers
  • You: Ability to synthesize complex technical and business inputs into clear, actionable strategic recommendations
  • You: Excellent communication and collaboration skills across technical, operational, and financial stakeholders
  • Preferred Experience: Prior experience in hyperscale or cloud infrastructure environments
  • Preferred Experience: Familiarity with GPU cluster sizing, workload forecasting, or energy-efficient compute architectures
  • Preferred Experience: Working knowledge of typical Data Center Infrastructure designs, topologies, systems and associated reliability/availability calculations
  • Preferred Experience: Knowledge of DCIM tools, telemetry systems, or utilization analytics platforms
  • Preferred Experience: Engineering degree from university, Masters preferred
  • Preferred Experience: Experience working across multi-disciplinary and non-technical teams to explain findings

Requirements

  • Self-starter with a proven ability to independently dive into the details to understand and solve hard problems across data center infrastructure and operations
  • Ability to provide world-class analysis, boiling complex issues into the root cause or few key drivers
  • 10+ years of experience working in directly in or closely with data center infrastructure and HPC/HW operations
  • Deep familiarity with AI or compute workload patterns, scaling dynamics, and infrastructure cost drivers
  • Ability to synthesize complex technical and business inputs into clear, actionable strategic recommendations
  • Excellent communication and collaboration skills across technical, operational, and financial stakeholders
  • Prior experience in hyperscale or cloud infrastructure environments
  • Familiarity with GPU cluster sizing, workload forecasting, or energy-efficient compute architectures
  • Working knowledge of typical Data Center Infrastructure designs, topologies, systems and associated reliability/availability calculations
  • Knowledge of DCIM tools, telemetry systems, or utilization analytics platforms
  • Engineering degree from university, Masters preferred
  • Experience working across multi-disciplinary and non-technical teams to explain findings