Salary
💰 $166,000 - $201,000 per year
Tech Stack
Distributed SystemsGoGrafanaHAProxyKubernetesLinuxNode.jsOpen SourcePrometheusPythonVault
About the role
- Design, build, and operate Kubernetes clusters on bare metal at scale
- Engineer full cluster lifecycle management (Talos bootstrapping, upgrades, node reprovisioning, HA control planes, recovery workflows)
- Architect networking, load balancing, and service mesh solutions optimized for bare metal
- Implement performant CNIs (Calico, Cilium), integrate L2/L3 networking, routing (BGP/ECMP), and optimize traffic across racks and datacenters
- Automate provisioning via PXE/iPXE, Tinkerbell, MAAS, and manage BMCs/IPMI/Redfish with standardized BIOS/firmware across heterogeneous hardware fleets
- Design and operate persistent storage (local disks, block, object) including Ceph, Rook, and openEBS
- Build automation and tooling (Go, Python, Bash) for provisioning, drift detection, upgrades, and incident response
- Extend observability with Prometheus, Alertmanager, Grafana, OpenTelemetry, and define SLOs for cluster health, latency, and workload availability
- Implement security best practices: Vault, cert-manager, RBAC hardening, network policies, and OS/K8s patch pipelines
- Mentor engineers and shape technical direction for Crusoe’s Kubernetes platform
Requirements
- 5+ years in infrastructure engineering, including 3+ years operating Kubernetes in production
- Strong experience running Kubernetes on bare metal (not just managed services)
- Expert-level knowledge of Linux internals (cgroups, namespaces, kernel networking)
- Deep experience with CNIs (Cilium, Calico), load balancers (Envoy, HAProxy, F5), and L3 networking (BGP, ECMP)
- Proven track record provisioning and operating physical servers at scale (PXE/iPXE, Tinkerbell, MAAS, BMC/IPMI automation)
- Strong programming skills in Go for building operators, controllers, and automation tooling
- Hands-on experience with distributed storage systems (Ceph, MinIO, Rook, CSI drivers)
- Strong background in observability (Prometheus, Alertmanager, metrics autoscaling, logging/ELK)
- Familiarity with PKI, identity, and secrets management (Vault, cert-manager)
- Excellent debugging skills for complex distributed systems
- Strong communication and collaboration across cross-functional teams
- Bonus: Experience with hardware fleet management across multiple datacenters
- Bonus: Contributions to open source Kubernetes or related ecosystem projects
- Bonus: Experience implementing disaster recovery strategies at scale
- Bonus: Familiarity with GPUs, HPC clusters, or large-scale AI/ML workloads