Tech Stack
AnsibleCloudKubernetesLinuxTerraform
About the role
- Design, deploy, and manage the compute infrastructure powering Fluidstack's GPU clusters
- Design and implement GPU/ASIC infrastructure at the server, rack, and system level
- Troubleshoot complex GPU and compute system related failures
- Develop and maintain hardware/firmware management services
- Automate all aspects of the server lifecycle
- Own end-to-end compute lifecycle, including partnering with vendors on RMAs
- Serve as the main point of contact for hardware escalation and troubleshooting
- Monitor system performance, identifying and resolving bottlenecks
- Automate deployment and management tasks to improve efficiency
- Collaborate with storage and network teams to ensure cohesive infrastructure operations
- Work closely with hardware and software teams to support AI workloads
Requirements
- 5+ years of experience in compute infrastructure engineering
- Strong knowledge of Linux systems administration and performance tuning
- Experience with bare metal provisioning tools (MaaS, Metal3, Tinkerbell, or other)
- Familiarity with GPU hardware and workload optimization, especially kernel and driver level requirements
- Proficiency in automation tools (e.g., Ansible, Terraform)
- Experience operating Kubernetes and SLURM clusters