Senior Site Reliability Engineer, Observability

Iterable

full-time

Posted on: 12/19/2025

Location Type: Hybrid

Location: Lisbon • 🇵🇹 Portugal

✨ AI Apply

Senior

ElasticSearchGoGrafanaKubernetesPrometheusPythonSDLCTerraform

About the role

Collaborate deeply with product teams to ensure the frameworks we provide actually solve their problems.
Own the long-term roadmap for Datadog, Grafana, Prometheus, Elasticsearch, Quickwit, and emerging OpenTelemetry tooling.
Design and automate scalable pipelines (metrics, traces, logs, events) so every engineer has sub-second, queryable visibility into production.
Drive upgrades, capacity modeling, and policy enforcement for our dedicated observability-focused clusters; introduce best-in-class patterns for multi-tenant isolation and cost optimization.
Contribute production-quality Go or Python services, operators, and Terraform modules that elevate reliability, performance, and developer velocity.
Partner with service owners to embed observability into their SDLC, guide best practices, perform instrumentation reviews, and elevate on-call readiness across the org.
Reduce MTTR, noise, and waste by designing cost-efficient telemetry architectures, high-signal alerting, and automated recovery patterns.
Lead and model operational excellence through on-call participation, post-incident reviews, and continuous improvement initiatives.

Proven ability to architect and manage production-grade Kubernetes (EKS) clusters, specifically for stateful workloads.
Proficiency of Infrastructure-as-code (IaC), including Terraform.
Deep production experience with Elasticsearch, Prometheus, or OpenTelemetry. You know how to tune these systems for multi-terabyte daily workloads.
Proficiency in Go or Python to build custom operators, internal tools, and automation.
Ability to optimize ingestion and storage for logs, metrics, and traces while balancing query performance with cost-efficiency.
Ability to influence engineering culture by mentoring peers and partnering with service owners to improve their observability posture.
A humble, collaborative approach to problem-solving and a bias toward systemic, automated solutions.

Benefits

Tip: use these terms in your resume and cover letter to boost ATS matches.

GoPythonTerraformKubernetesElasticsearchPrometheusOpenTelemetryInfrastructure-as-codescalable pipelinestelemetry architectures

collaborationmentoringproblem-solvinginfluenceoperational excellencecontinuous improvementcommunicationleadershipcost optimizationsystemic solutions