Lambda

Senior Site Reliability Engineer – Managed Kubernetes

Lambda

full-time

Posted on:

Origin:  • 🇺🇸 United States • California

Visit company website
AI Apply
Apply

Salary

💰 $267,000 - $401,000 per year

Job Level

Senior

Tech Stack

CloudGoGrafanaKubernetesLinuxPrometheusPython

About the role

  • Operate and maintain bare-metal Kubernetes clusters, scaling up to thousands of nodes
  • Handle cluster degradation, recovery, resizing, and incident response using fleet management tools
  • Participate in a well-managed on-call rotation for critical incidents
  • Assist customers with Kubernetes questions, workload integration, storage, and authentication
  • Work closely with HPC Ops and Datacenter Ops teams for low-level or cross-functional issues
  • Use Python and Golang to create tooling and automate the validation of platform quality
  • Design, build, and maintain scalable control plane services, operators, and custom controllers for Kubernetes
  • Develop automation for cluster lifecycle management: provisioning, upgrades, patching, and deletion
  • Define and implement SLOs and SLIs for Kubernetes services, workloads, and platform reliability

Requirements

  • 6+ years of experience in a SRE, operations engineer, or similar role, with a deep knowledge of running Linux clusters and systems
  • Strong programming skills in Go and Python; experience with GitOps (e.g., ArgoCD), Helm, and Kubernetes operators
  • Proven experience operating Kubernetes clusters in production environments (on-prem, EKS, GKE, or similar)
  • Can work either independently with limited direction or as part of a team
  • Can work with customers during incidents either via tickets, live messaging, or as part of a larger call
  • Familiarity with observability tools like Prometheus, Grafana, FluentBit, and CI/CD pipelines
  • Proven experience provisioning Kubernetes using tools such as kubeadm, Cluster API, or similar
  • Deep Kubernetes expertise: CRDs, CSI, CNI, Kubernetes Operator Coding experience (nice-to-have)
  • Exposure to HPC clusters, AI/ML workloads, or large-scale GPU clusters (nice-to-have)
  • Hybrid or multi-cloud Kubernetes environment experience (nice-to-have)
  • Contributions to CNCF projects or Kubernetes SIGs (nice-to-have)
Comet

Senior DevOps Engineer

Comet
Seniorfull-time🇺🇸 United States
Posted: 13 days agoSource: boards.greenhouse.io
AWSCloudDockerGoGoogle Cloud PlatformGrafanaJavaKubernetesLinuxOpen SourcePrometheusPython+2 more
Serve Robotics

Senior Software Engineer, Infrastructure and Developer Productivity, GCP

Serve Robotics
Seniorfull-time$150k–$200k / year🇺🇸 United States
Posted: 4 hours agoSource: jobs.ashbyhq.com
CloudDistributed SystemsGoogle Cloud PlatformGrafanaIoTPrometheusPythonRust
Motional

Developer Platform Engineer

Motional
Mid · Seniorfull-time$131k–$171k / yearMassachusetts, Pennsylvania · 🇺🇸 United States
Posted: 3 days agoSource: boards.greenhouse.io
AWSCloudDockerGoogle Cloud PlatformGrafanaJenkinsKubernetesLinuxPrometheusPythonVault
Anvilogic

Senior Software Engineer, Data

Anvilogic
Seniorfull-time🇺🇸 United States
Posted: 6 hours agoSource: apply.workable.com
AWSAzureCloudCyber SecurityGoogle Cloud PlatformGrafanaPythonSplunkTerraform
Fivetran

Principal Software Engineer – Data Lakes

Fivetran
Leadfull-time🇩🇪 Germany
Posted: 8 days agoSource: boards.greenhouse.io
AWSAzureGoogle Cloud PlatformGrafanaGRPCJavaKubernetesPostgres