Tech Stack
CloudGrafanaKubernetesPrometheus
About the role
- Lead and coordinate the development of bare-metal and virtualized GPU cluster offerings.
- Work closely with SRE, hardware, and cluster teams to deliver robust infrastructure.
- Drive the infrastructure and cluster roadmaps, aligning priorities across teams and ensuring clear delivery goals.
- Oversee tracking of server infrastructure, ensuring visibility and accountability for hardware usage and deployments.
- Align team efforts with company objectives, resolving priorities across multiple streams of work.
- Implement and improve processes for roadmapping, prioritization, and cross-team collaboration.
- Mentor and support engineers, fostering a culture of collaboration, delivery, and innovation.
- Contribute to long-term scalability by identifying and addressing systemic infrastructure challenges.
Requirements
- Proven experience as an Engineering Manager or Senior Technical Lead, ideally in a start-up or scale-up environment.
- Strong background in bare metal server management and distributed computing/HPC.
- Experience with virtualization in large-scale environments.
- Strong leadership, organizational, and cross-functional communication skills.
- Excellent communication skills, both technical and non-technical.
- Nice-to-haves: Experience with MaaS and infrastructure automation; Experience with latest generation GPU systems; Experience with high-performance networking (RDMA, InfiniBand, RoCE); Experience with HPC workload orchestration using Slurm and/or Kubernetes; Experience with observability and monitoring stack (Grafana, Prometheus, ELK); Exposure to hardware lifecycle management and data center operations.