Senior Site Reliability Engineer

Finom

Senior SRE Engineer at Finom driving the design and implementation of a Kubernetes-based platform. Focused on reliability and scalability in a high-load, multi-cloud environment.

Posted 5/27/2026full-timeRemote • 🇧🇬 BulgariaSeniorWebsite

ATS Keywords

Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills

KubernetesGKEGitOpsArgoCDCI/CDGitLabTerraformPrometheusOpenTelemetryInfrastructure as Code

Soft Skills

cross-functional communicationleadershipproblem-solvingcollaborationproactive approach

Tools & Technologies

GCPAWSGrafanaincident management frameworksdisaster recoveryblue/green deploymentactive/passive strategies

Industry Keywords

high availabilityzero-downtime operationsSLOsSLAsautomated failoverrollback capabilitiesoperational efficiencybottleneck detection

Tech Stack

Tools & technologies

AWSGoogle Cloud PlatformGrafanaKubernetesPrometheusTerraform

About the role

Key responsibilities & impact

Lead the Platform Evolution: Design and operate our Kubernetes ecosystem (GKE, multi-cluster) with a focus on high availability and zero-downtime operations.
Build "Paved Roads": Own and evolve our PaaS strategy, using GitOps (ArgoCD) and CI/CD (GitLab) to empower domain teams to deploy independently.
Architect Reliability: Define and implement our observability strategy across metrics, logs, and tracing (Prometheus, VictoriaMetrics, OpenTelemetry).
Drive Infrastructure-as-Code: Lead the automation of our infrastructure using Terraform, ensuring all resources are standardized and version-controlled.
Own the Error Budget: Partner with engineering teams to establish and manage SLOs, SLAs, and incident management frameworks.
Disaster Recovery Mastery: Design and participate in regular DR drills, implementing blue/green and active/passive strategies across regions to ensure service continuity.
Innovate Operations: Proactively apply AI-driven approaches to improve operational efficiency and automated bottleneck detection.

Requirements

What you’ll need

Strong hands-on experience managing Kubernetes (GKE preferred) in high-load, multi-cluster production environments
Deep experience with GCP (AWS is a strong plus) and Terraform for large-scale infrastructure
Solid experience with ArgoCD, GitLab CI, and the "Infrastructure as Code" philosophy
Deep knowledge of the Prometheus/Grafana stack and implementing tracing/logging at scale
Proven ability to design highly available 24/7 systems with automated failover and rollback capabilities
English level B2+ for effective cross-functional communication

Benefits

Comp & perks

Make a genuine impact on the product
Work in the EU
Become a stock options holder
Receive unwavering support and care
Work & Swim program
Equal Opportunity Statement