Salary
💰 $267,000 - $401,000 per year
Tech Stack
CloudGoGrafanaKubernetesLinuxPrometheusPython
About the role
- Operate and maintain bare-metal Kubernetes clusters, scaling up to thousands of nodes
- Handle cluster degradation, recovery, resizing, and incident response using fleet management tools
- Participate in a well-managed on-call rotation for critical incidents
- Assist customers with Kubernetes questions, workload integration, storage, and authentication
- Work closely with HPC Ops and Datacenter Ops teams for low-level or cross-functional issues
- Use Python and Golang to create tooling and automate the validation of platform quality
- Design, build, and maintain scalable control plane services, operators, and custom controllers for Kubernetes
- Develop automation for cluster lifecycle management: provisioning, upgrades, patching, and deletion
- Define and implement SLOs and SLIs for Kubernetes services, workloads, and platform reliability
Requirements
- 6+ years of experience in a SRE, operations engineer, or similar role, with a deep knowledge of running Linux clusters and systems
- Strong programming skills in Go and Python; experience with GitOps (e.g., ArgoCD), Helm, and Kubernetes operators
- Proven experience operating Kubernetes clusters in production environments (on-prem, EKS, GKE, or similar)
- Can work either independently with limited direction or as part of a team
- Can work with customers during incidents either via tickets, live messaging, or as part of a larger call
- Familiarity with observability tools like Prometheus, Grafana, FluentBit, and CI/CD pipelines
- Proven experience provisioning Kubernetes using tools such as kubeadm, Cluster API, or similar
- Deep Kubernetes expertise: CRDs, CSI, CNI, Kubernetes Operator Coding experience (nice-to-have)
- Exposure to HPC clusters, AI/ML workloads, or large-scale GPU clusters (nice-to-have)
- Hybrid or multi-cloud Kubernetes environment experience (nice-to-have)
- Contributions to CNCF projects or Kubernetes SIGs (nice-to-have)