Staff Software Engineer – Grafana Databases, Managed Services

Grafana Labs

Staff Software Engineer managing 100+ streaming clusters in Grafana's cloud infrastructure. Working with distributed systems and enhancing reliability for mission-critical services.

Posted 4/13/2026full-timeRemote • 🇪🇸 SpainLead💰 €94,025 - €112,830 per yearWebsite

Tech Stack

Tools & technologies

AWSAzureCassandraCloudDistributed SystemsGoGoogle Cloud PlatformKafkaKubernetesLinuxPostgresTerraform

About the role

Key responsibilities & impact

Operating and evolving 100+ multi-cloud streaming clusters and related database infrastructure
Diagnosing and eliminating cross-layer failure modes (e.g., object storage latency, noisy neighbors, control-plane bottlenecks, query performance regressions, etc.)
Designing safe upgrade and rollout strategies at scale
Improving observability, automation, and operational ergonomics
Partnering closely with database and platform teams to ensure safe scaling, partitioning, consumer fan-out, and query performance
Working directly with distributed systems behavior, Kubernetes scheduling dynamics, storage engines, compression trade-offs, etc.
Serving as a primary escalation point and on-call for relevant incidents
Owning the relationship with all system vendors, including WarpStream Labs and others.

Requirements

What you’ll need

8+ years of engineering experience, including meaningful time in SRE, platform engineering, production engineering, infrastructure engineering, or distributed systems roles.
Experience with high-throughput streaming systems, analytical or storage backends, or large-scale database infrastructure. Examples of these include Kafka, Redpanda, WarpStream, Postgres, ClickHouse, Snowflake, or Cassandra.
Strong Kubernetes experience in AWS, GCP, or Azure, and familiarity with infrastructure-as-code tooling (Helm, Terraform, Jsonnet, etc.).
Experience leading or driving complex technical efforts, even without formal management responsibilities
Ability to influence technical direction and align teams around reliability improvements
Strong understanding of distributed systems failure modes in multi-cloud environments.
Proficiency in at least one systems-oriented language (Go preferred, but not required).
Working knowledge of Linux internals, networking, cloud storage, and performance/scaling behavior.
Experience participating in blameless incident response and writing high-quality post-incident reviews.
Clear communicator who can collaborate across teams and work autonomously.
Intellectually curious, transparent, action-oriented, and kind (this is important!)

Benefits

Comp & perks

Restricted Stock Units (RSUs)
Health insurance
30 days annual leave
Company-funded usage budget for developer tools

ATS Keywords

✓ Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

KubernetesGoLinux internalsnetworkingcloud storageperformance behaviorinfrastructure-as-codestreaming systemsdatabase infrastructureincident response

Soft Skills

clear communicatorcollaborationautonomyinfluence technical directionlead complex technical effortsintellectual curiositytransparencyaction-orientedkindness