DevOps Engineer

• Operate and maintain bare-metal Kubernetes clusters, scaling up to thousands of nodes
• Handle cluster degradation, recovery, resizing, and incident response using fleet management tools
• Participate in a well-managed on-call rotation for critical incidents
• Assist customers with Kubernetes questions, workload integration, storage, and authentication
• Work closely with HPC Ops and Datacenter Ops teams for low-level or cross-functional issues
• Use Python and Golang to create tooling and automate the validation of platform quality
• Design, build, and maintain scalable control plane services, operators, and custom controllers for Kubernetes
• Develop automation for cluster lifecycle management: provisioning, upgrades, patching, and deletion
• Define and implement SLOs and SLIs for Kubernetes services, workloads, and platform reliability

Senior Site Reliability Engineer – Managed Kubernetes

Manager, Site Reliability Engineering

AWS DevOps Engineer

Associate Site Reliability Engineer

Senior Site Reliability Engineer

Site Reliability Engineer