Salary
💰 $166,000 - $204,000 per year
Tech Stack
AWSAzureCloudDistributed SystemsGoGoogle Cloud PlatformKubernetesLinuxNode.jsPythonRustTerraform
About the role
- Architect, build, and operate features for Crusoe’s Managed Kubernetes platform (control plane, autoscaling, cluster lifecycle, upgrades, multi-tenancy).
- Integrate and optimize GPU workloads within Kubernetes clusters, including device plugins, GPU operators, scheduling, and monitoring.
- Enhance container networking through advanced CNI integration (Cilium, Calico, Multus) and support for high-performance networking (InfiniBand, RoCE).
- Improve reliability and resilience of Kubernetes clusters, including HA control planes, node lifecycle management, and self-healing mechanisms.
- Contribute to open-source and internal tooling that enhances observability, automation, and cluster security.
- Participate in design reviews, provide mentorship to engineers, and help set long-term technical direction.
- Troubleshoot complex distributed systems problems spanning containers, GPUs, and networking.
Requirements
- 5+ years of software engineering experience in distributed systems, cloud, or infrastructure.
- Deep understanding of Kubernetes internals (control plane, scheduling, operators, controllers, API machinery).
- Strong proficiency in Go (preferred) or similar languages (Rust, C++, Python for systems work).
- Experience with container networking (CNI plugins, service mesh, load balancing) and Linux networking fundamentals.
- Exposure to GPU workloads in Kubernetes (device plugins, GPU operators, scheduling, autoscaling).
- Familiarity with cloud platforms (AWS, GCP, or Azure) and infrastructure automation (Terraform, Helm, GitOps).
- Strong debugging and performance optimization skills for distributed systems.
- Passion for building reliable, developer-friendly platforms that abstract complexity for customers.
- Familiarity with NVIDIA and AMD GPUs, device plugins, and operators for GPU lifecycle management.
- Knowledge of network operators and CNI implementations (Cilium, Calico, Multus).
- Experience with high-performance networking technologies (InfiniBand, RoCE).
- Contributions to Kubernetes SIGs, CNCF projects, or related open-source communities.
- Experience with Slurm, MPI, or HPC-style job schedulers.
- Familiarity with service meshes (Istio, Linkerd) and multi-cluster networking.
- Background in security for containers, GPUs, and Kubernetes (PodSecurity, RBAC, runtime scanning).