FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.

Lead Site Reliability Engineer
Akka (formerly Lightbend)Lead SRE ensuring reliability, scalability, and security for Akka's cloud-native PaaS multitenant platform. Engage in operational excellence across AWS, GCP, and Azure environments.
Tech Stack
Tools & technologiesAWSAzureCloudDNSFluxGoGoogle Cloud PlatformGrafanaKubernetesPrometheusRustShell Scripting
About the role
Key responsibilities & impact- Own Service Level Objectives/Service Level Indicators (SLOs/SLIs) and error budgets across multi-cloud clusters (EKS, GKE, AKS); drive blameless post-mortems and systemic remediation.
- Lead capacity planning with our customers, cluster lifecycle management, and Kubernetes and database upgrade cycles.
- Define and enforce runbooks, on-call rotations, and escalation paths for the wider engineering organisation.
- Own and evolve the IaC layer: Helm charts, Crossplane compositions, and FluxCD GitOps pipelines.
- Design and maintain cloud-resource provisioning workflows that span all three cloud providers, with consistent policy controls.
- Architect and operate connectivity patterns: AWS PrivateLink / Transit Gateway, GCP NCC, Azure VNet Peering, and cross-region ingress with Contour/Envoy.
- Maintain and evolve the Linkerd service mesh for mTLS, workload identity (OIDC), and zero-trust authorisation policies.
- Drive PKI hygiene with cert-manager: root/intermediate CA rotation, ACME certificate lifecycle, and secret management via KMS-backed Kubernetes vaults.
- Own the observability stack: Prometheus, Cortex (multi-tenant metrics), OpenTelemetry sidecars, centralised log pipelines, and Groundcover / Grafana dashboards.
- Establish alerting standards and SLO-based alerting rules; ensure distributed traces are actionable across JVM, Rust, and Go workloads.
- Actively participate in on-call and lead the technical response for platform-level incidents.
- Set engineering standards and review infrastructure changes across the team.
- Partner with Security, Product, and Application Engineering to translate reliability requirements into platform capabilities.
- Grow a team of 3–5 SREs through code review, architecture sessions, and career conversations.
Requirements
What you’ll need- 7+ years in SRE, platform engineering, or infrastructure engineering roles.
- Deep, hands-on Kubernetes experience: operating and scaling clusters across at least two of GKE, EKS, and AKS in production.
- Proven IaC ownership: Helm chart authoring, Crossplane provider/composition design, and GitOps with Flux or ArgoCD.
- Strong multi-cloud networking: VPC design, private connectivity (PrivateLink, NCC, VNet Peering), and DNS (Route 53, Cloud DNS, Azure DNS, Cloudflare).
- Production experience with a service mesh (Linkerd or Istio) and Envoy-based ingress.
- Solid observability track record with Prometheus, distributed tracing (OpenTelemetry), and structured logging pipelines.
- Experience securing Kubernetes clusters: RBAC, workload identity / OIDC, mTLS, and secret management with cloud KMS.
- Comfortable reading and writing at least one systems language (Go, Rust, or similar) and shell scripting for automation and operator development.
Benefits
Comp & perks- Competitive salary and equity, benchmarked against senior/lead IC roles in your market.
- Remote-first culture with flexible working hours.
- Comprehensive health and wellness benefits.
- Opportunities for professional development and continuous learning.
- Collaborative, inclusive, and innovative company culture.
- A team that has strong opinions, writes good documentation, and builds things they are proud of.
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
KubernetesIaCHelmCrossplaneGitOpsPrometheusOpenTelemetrymTLSGoRust
Soft Skills
leadershipcapacity planningtechnical responsecode reviewcareer conversations