Apply

Ready to go for it?

AI Apply speeds things up—apply directly if you prefer.

FREE ACCESS
5,000–10,000 jobs/day
JobTailor Logo

See all jobs on JobTailor

Search thousands of fresh jobs every day.

Discover
  • Fresh listings
  • Fast filters
  • No subscription required
Create a free account and start exploring right away.
Akka (formerly Lightbend)

Lead Site Reliability Engineer

Akka (formerly Lightbend)

Lead SRE ensuring reliability, scalability, and security for Akka's cloud-native PaaS multitenant platform. Engage in operational excellence across AWS, GCP, and Azure environments.

Posted 5/26/2026full-timeRemote • 🇺🇸 United StatesSeniorWebsite

Tech Stack

Tools & technologies
AWSAzureCloudDNSFluxGoGoogle Cloud PlatformGrafanaKubernetesPrometheusRustShell Scripting

About the role

Key responsibilities & impact
  • Own Service Level Objectives/Service Level Indicators (SLOs/SLIs) and error budgets across multi-cloud clusters (EKS, GKE, AKS); drive blameless post-mortems and systemic remediation.
  • Lead capacity planning with our customers, cluster lifecycle management, and Kubernetes and database upgrade cycles.
  • Define and enforce runbooks, on-call rotations, and escalation paths for the wider engineering organisation.
  • Own and evolve the IaC layer: Helm charts, Crossplane compositions, and FluxCD GitOps pipelines.
  • Design and maintain cloud-resource provisioning workflows that span all three cloud providers, with consistent policy controls.
  • Architect and operate connectivity patterns: AWS PrivateLink / Transit Gateway, GCP NCC, Azure VNet Peering, and cross-region ingress with Contour/Envoy.
  • Maintain and evolve the Linkerd service mesh for mTLS, workload identity (OIDC), and zero-trust authorisation policies.
  • Drive PKI hygiene with cert-manager: root/intermediate CA rotation, ACME certificate lifecycle, and secret management via KMS-backed Kubernetes vaults.
  • Own the observability stack: Prometheus, Cortex (multi-tenant metrics), OpenTelemetry sidecars, centralised log pipelines, and Groundcover / Grafana dashboards.
  • Establish alerting standards and SLO-based alerting rules; ensure distributed traces are actionable across JVM, Rust, and Go workloads.
  • Actively participate in on-call and lead the technical response for platform-level incidents.
  • Set engineering standards and review infrastructure changes across the team.
  • Partner with Security, Product, and Application Engineering to translate reliability requirements into platform capabilities.
  • Grow a team of 3–5 SREs through code review, architecture sessions, and career conversations.

Requirements

What you’ll need
  • 7+ years in SRE, platform engineering, or infrastructure engineering roles.
  • Deep, hands-on Kubernetes experience: operating and scaling clusters across at least two of GKE, EKS, and AKS in production.
  • Proven IaC ownership: Helm chart authoring, Crossplane provider/composition design, and GitOps with Flux or ArgoCD.
  • Strong multi-cloud networking: VPC design, private connectivity (PrivateLink, NCC, VNet Peering), and DNS (Route 53, Cloud DNS, Azure DNS, Cloudflare).
  • Production experience with a service mesh (Linkerd or Istio) and Envoy-based ingress.
  • Solid observability track record with Prometheus, distributed tracing (OpenTelemetry), and structured logging pipelines.
  • Experience securing Kubernetes clusters: RBAC, workload identity / OIDC, mTLS, and secret management with cloud KMS.
  • Comfortable reading and writing at least one systems language (Go, Rust, or similar) and shell scripting for automation and operator development.

Benefits

Comp & perks
  • Competitive salary and equity, benchmarked against senior/lead IC roles in your market.
  • Remote-first culture with flexible working hours.
  • Comprehensive health and wellness benefits.
  • Opportunities for professional development and continuous learning.
  • Collaborative, inclusive, and innovative company culture.
  • A team that has strong opinions, writes good documentation, and builds things they are proud of.

ATS Keywords

✓ Tailor your resume
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
KubernetesIaCHelmCrossplaneGitOpsPrometheusOpenTelemetrymTLSGoRust
Soft Skills
leadershipcapacity planningtechnical responsecode reviewcareer conversations