Define infrastructure platform including end-to-end hardware lifecycle, DCIM, and building controls.
Work directly with engineering and infrastructure teams and collaborate closely with datacenter design and operations teams to build tools enabling hundreds of thousands of GPUs of capacity and reliable multi-year operation.
Own the processes and tools to manage the lifecycle of all hardware within a datacenter environment, from shipment through to retirement.
Partner with infrastructure and engineering to translate datacenter design and operational requirements into technical specifications, including the DCIM system, network automation, and ZTP of compute infrastructure.
Work with external datacenter and operations partners to integrate their systems into our tooling.
Collaborate with hardware vendors on naming schemes, asset management, factory integration, RMA process, and other stages of the hardware lifecycle.
Requirements
3-5 years of experience building developer tools or cloud infrastructure, ideally building DCIM tools, or managing the lifecycle of compute and networking infrastructure.
Strong understanding of AI/ML workloads and infrastructure, including GPU acceleration, model training and inference pipelines, and modern datacenter architecture.
Familiarity with DCIM tools like Netbox as well as bare metal provisioning and management tools (e.g. MaaS, Tinkerbell, Metal3).
Familiarity with industrial protocols like Modbus for telemetry/management of CDUs/UPSes/CRACs/ATSes/etc.
Knowledge of Infrastructure-as-Code (IaC) tools (e.g. Terraform, Pulumi).
Understanding of SLA, SLO, frameworks and error budget management, as well as the ability to build new Grafana dashboards to track metrics that matter.
Excellent communication and cross-functional leadership skills.
Comfortable designing and working with APIs.
Strong product intuition and taste in developer experience and tooling.