Design, build, and operate foundational systems that power our products.
Blend deep infrastructure expertise with software engineering discipline to create scalable, resilient, and developer‑friendly platforms.
Partner closely with engineering teams to evolve our platform architecture, improve reliability, and accelerate delivery through automation, observability, and thoughtful system design.
Build and evolve multiple distributions of Kubernetes platform.
Build automation and tooling to streamline deployments, configuration, and environment management.
Drive reliability practices such as SLOs, error budgets, incident responses, and post‑incident reviews.
Develop golden paths for service onboarding, CI/CD, and platform usage across all K8s variants.
Implement observability systems including metrics, logging, tracing, and alerting.
Collaborate with product and engineering teams to ensure platform capabilities meet evolving needs.
Optimize performance and capacity across compute, storage, and networking layers.
Champion infrastructure-as-code and modern cloud‑native patterns.
Drive automation-first operations using IaC and GitOps.
Lead incident response, RCA, post-incident learning, and improve on-call health.
Partner with security teams to enforce platform guardrails, policy, and secure defaults.
Lead complex troubleshooting efforts across distributed systems and production environments.
Mentor engineers and contribute to a culture of operational excellence.

Requirements

8+ years in SRE, DevOps, or platform engineering with hands‑on ownership of production systems.
Expertise in Kubernetes and container orchestration at scale.
Proficiency with IaC tools such as Terraform, Ansible, and CloudFormation.
Solid programming skills in languages such as Go, Python, or Bash.
Deep understanding of distributed systems, networking, and Linux internals.
Experience building CI/CD pipelines using tools like Gitlab runners, GitHub Actions, or Jenkins.
Strong observability background with Prometheus, Grafana, Open Telemetry, or similar.
Proven track record of incident management and improving system reliability.

Benefits

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

KubernetesIaCTerraformAnsibleCloudFormationGoPythonBashCI/CDPrometheus

Soft Skills

mentoringcollaborationincident managementtroubleshootingoperational excellence