Crusoe

Senior Staff Software Engineer

Crusoe

full-time

Posted on:

Origin:  • 🇺🇸 United States • California

Visit company website
AI Apply
Apply

Salary

💰 $233,000 - $282,000 per year

Job Level

Senior

Tech Stack

Distributed SystemsGoGrafanaHAProxyKubernetesLinuxNode.jsOpen SourcePrometheusPythonVault

About the role

  • Design, build, and operate Kubernetes clusters on bare metal at scale
  • Engineer full cluster lifecycle management (Talos bootstrapping, upgrades, node reprovisioning, HA control planes, recovery workflows)
  • Architect networking, load balancing, and service mesh solutions optimized for bare metal
  • Implement performant CNIs (Calico, Cilium), integrate L2/L3 networking and routing (BGP/ECMP), and optimize traffic across racks and datacenters
  • Automate provisioning via PXE/iPXE, Tinkerbell, MAAS, and manage BMCs/IPMI/Redfish with standardized BIOS/firmware across heterogeneous hardware fleets
  • Design and operate persistent storage (local disks, block, object) including Ceph, Rook, and openEBS
  • Build automation and tooling (Go, Python, Bash) for provisioning, drift detection, upgrades, and incident response
  • Extend observability with Prometheus, Alertmanager, Grafana, OpenTelemetry, and define SLOs for cluster health, latency, and workload availability
  • Implement security best practices: Vault, cert-manager, RBAC hardening, network policies, and OS/K8s patch pipelines
  • Mentor engineers and shape technical direction for Crusoe’s Kubernetes platform

Requirements

  • 10+ years in infrastructure engineering, including 3+ years operating Kubernetes in production
  • Strong experience running Kubernetes on bare metal (not just managed services)
  • Expert-level knowledge of Linux internals (cgroups, namespaces, kernel networking)
  • Deep experience with CNIs (Cilium, Calico), load balancers (Envoy, HAProxy, F5), and L3 networking (BGP, ECMP)
  • Proven track record provisioning and operating physical servers at scale (PXE/iPXE, Tinkerbell, MAAS, BMC/IPMI automation)
  • Strong programming skills in Go for building operators, controllers, and automation tooling
  • Hands-on experience with distributed storage systems (Ceph, MinIO, Rook, CSI drivers)
  • Strong background in observability (Prometheus, Alertmanager, metrics autoscaling, logging/ELK)
  • Familiarity with PKI, identity, and secrets management (Vault, cert-manager)
  • Excellent debugging skills for complex distributed systems
  • Strong communication and collaboration across cross-functional teams
  • Bonus: Experience with hardware fleet management across multiple datacenters
  • Bonus: Contributions to open source Kubernetes or related ecosystem projects
  • Bonus: Experience implementing disaster recovery strategies at scale
  • Bonus: Familiarity with GPUs, HPC clusters, or large-scale AI/ML workloads
Crusoe

Senior Software Engineer

Crusoe
Seniorfull-time$166k–$201k / yearCalifornia · 🇺🇸 United States
Posted: 3 hours agoSource: jobs.ashbyhq.com
Distributed SystemsGoGrafanaHAProxyKubernetesLinuxNode.jsOpen SourcePrometheusPythonVault
Crusoe

Staff Software Engineer

Crusoe
Leadfull-time$204k–$247k / yearCalifornia · 🇺🇸 United States
Posted: 3 hours agoSource: jobs.ashbyhq.com
Distributed SystemsGoGrafanaHAProxyKubernetesLinuxNode.jsOpen SourcePrometheusPythonVault
DDN

Senior Staff Engineer – AI In-Market Engineering

DDN
Seniorfull-time🇺🇸 United States
Posted: 18 days agoSource: careers-ddn.icims.com
CloudDistributed SystemsGoGrafanaKubernetesLinuxNFSPrometheusPythonTCP/IP
Articul8 AI

Senior Software Development Engineer in Test, Chaos Engineering Specialist

Articul8 AI
Seniorfull-timeCalifornia · 🇺🇸 United States
Posted: 18 days agoSource: jobs.ashbyhq.com
AWSAzureCloudDistributed SystemsGoGoogle Cloud PlatformGrafanaKubernetesPrometheusPythonRust
Articul8 AI

Senior Site Reliability Engineer, SRE

Articul8 AI
Seniorfull-timeCalifornia · 🇺🇸 United States
Posted: 18 days agoSource: jobs.ashbyhq.com
AWSAzureCloudDistributed SystemsDockerGoGoogle Cloud PlatformGrafanaKubernetesNoSQLPrometheusPython+2 more