Salary
💰 $267,000 - $401,000 per year
Tech Stack
AWSCloudGoGrafanaKubernetesPrometheusPythonTerraform
About the role
- Architect, deploy, and manage Kubernetes clusters across AWS, OCI, and on-prem datacenters
- Build and maintain automation for cluster lifecycle management, upgrades, and scaling
- Own the reliability, performance, and security of Kubernetes workloads
- Implement observability, logging, and alerting for clusters and critical workloads
- Partner with developers to design scalable, cloud-native services and CI/CD pipelines
- Define and enforce best practices for resource usage, networking, and RBAC
- Lead incident response, root cause analysis, and post-mortems for cluster-related issues
- Mentor junior engineers and contribute to internal platform engineering standards
Requirements
- 5+ years of experience in Platform, Infrastructure, or SRE roles
- Expert knowledge of Kubernetes internals and operational practices
- Proven experience running Kubernetes clusters in production at scale
- Strong skills with Helm, Kustomize, or similar deployment tooling
- Solid understanding of networking, service meshes, and container runtimes
- Proficiency in infrastructure-as-code (Terraform, Pulumi, etc.)
- Experience with observability stacks (Prometheus, Grafana, ELK, OpenTelemetry)
- Familiarity with security best practices (network policies, secrets management, image scanning)
- Strong coding skills in Go, Python, or similar for automation
- Comfort with GitOps workflows and CI/CD integration
- Excellent problem-solving skills and ability to operate in complex environments
- Willingness and ability to work onsite at San Francisco, San Jose, or Seattle office 4 days per week
- Experience with multi-cluster, multi-cloud, or hybrid environments (nice to have)
- Knowledge of GPU scheduling, HPC workloads, or ML/AI infrastructure (nice to have)
- Exposure to cost optimization and capacity planning for large clusters (nice to have)
- Contributions to CNCF or Kubernetes open-source projects (nice to have)