Salary
💰 $208,000 - $333,500 per year
Tech Stack
AnsibleAWSAzureChefCloudDistributed SystemsDNSGoGoogle Cloud PlatformGrafanaKubernetesLinuxMicroservicesPrometheusPuppetPythonSplunkTCP/IPTerraform
About the role
- Support large-scale Kubernetes services before they launch through system creation consulting, developing software tools, platforms, and frameworks, capacity management, and launch reviews
- Build, implement and support operational and reliability aspects of large-scale Kubernetes clusters with a focus on performance at scale, real-time monitoring, logging and alerting
- Define SLOs/SLIs, monitor error budgets, and streamline reporting
- Maintain services once they are live by measuring and monitoring availability, latency, and overall system health
- Operate and optimize GPU workloads across AWS, GCP, Azure, OCI, and private clouds
- Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity
- Lead triage and root-cause analysis of high-severity incidents
- Practice balanced incident response and blameless postmortems
- Participate in on-call rotation to support production services
Requirements
- BS in Computer Science or related technical field, or equivalent experience
- 12+ years of experience operating production services at scale
- Expert-level knowledge of Kubernetes administration, containerization, and microservices architecture, with deep experience in Kubernetes operators and distributed systems at scale
- Experience with infrastructure automation tools (Terraform, Ansible, Chef, Puppet)
- Proficiency in at least one high-level programming language (e.g., Python, Go)
- In-depth knowledge of Linux operating systems, networking fundamentals (TCP/IP), and cloud security standards
- Demonstrated ability to troubleshoot complex DNS, network, Kubernetes, and systems issues in production environments
- Proficient knowledge of SRE principles, encompassing SLOs, SLIs, error budgets, and incident handling
- Experience building and operating comprehensive observability stacks (monitoring, logging, tracing) using tools like OpenTelemetry, Prometheus, Grafana, ELK Stack, Lightstep, Splunk, Datadog, etc.
- Ways to stand out from the crowd: Operating GPU-accelerated clusters with KubeVirt in production; Applying generative-AI techniques to reduce operational toil; Automating incidents with Shoreline or StackStorm