Senior Software Engineer, Managed Orchestration, Kubernetes

Crusoe

full-time

Posted on: 9/4/2025

Location: California • 🇺🇸 United States

✨ AI Apply

💰 $166,000 - $204,000 per year

Senior

AWSAzureCloudDistributed SystemsGoGoogle Cloud PlatformKubernetesLinuxNode.jsPythonRustTerraform

About the role

Architect, build, and operate features for Crusoe’s Managed Kubernetes platform (control plane, autoscaling, cluster lifecycle, upgrades, multi-tenancy).
Integrate and optimize GPU workloads within Kubernetes clusters, including device plugins, GPU operators, scheduling, and monitoring.
Enhance container networking through advanced CNI integration (Cilium, Calico, Multus) and support for high-performance networking (InfiniBand, RoCE).
Improve reliability and resilience of Kubernetes clusters, including HA control planes, node lifecycle management, and self-healing mechanisms.
Contribute to open-source and internal tooling that enhances observability, automation, and cluster security.
Participate in design reviews, provide mentorship to engineers, and help set long-term technical direction.
Troubleshoot complex distributed systems problems spanning containers, GPUs, and networking.

5+ years of software engineering experience in distributed systems, cloud, or infrastructure.
Deep understanding of Kubernetes internals (control plane, scheduling, operators, controllers, API machinery).
Strong proficiency in Go (preferred) or similar languages (Rust, C++, Python for systems work).
Experience with container networking (CNI plugins, service mesh, load balancing) and Linux networking fundamentals.
Exposure to GPU workloads in Kubernetes (device plugins, GPU operators, scheduling, autoscaling).
Familiarity with cloud platforms (AWS, GCP, or Azure) and infrastructure automation (Terraform, Helm, GitOps).
Strong debugging and performance optimization skills for distributed systems.
Passion for building reliable, developer-friendly platforms that abstract complexity for customers.
Familiarity with NVIDIA and AMD GPUs, device plugins, and operators for GPU lifecycle management.
Knowledge of network operators and CNI implementations (Cilium, Calico, Multus).
Experience with high-performance networking technologies (InfiniBand, RoCE).
Contributions to Kubernetes SIGs, CNCF projects, or related open-source communities.
Experience with Slurm, MPI, or HPC-style job schedulers.
Familiarity with service meshes (Istio, Linkerd) and multi-cluster networking.
Background in security for containers, GPUs, and Kubernetes (PodSecurity, RBAC, runtime scanning).