Iterable

Senior Site Reliability Engineer, Observability

Iterable

full-time

Posted on:

Location Type: Hybrid

Location: Lisbon • 🇵🇹 Portugal

Visit company website
AI Apply
Apply

Job Level

Senior

Tech Stack

ElasticSearchGoGrafanaKubernetesPrometheusPythonSDLCTerraform

About the role

  • Collaborate deeply with product teams to ensure the frameworks we provide actually solve their problems.
  • Own the long-term roadmap for Datadog, Grafana, Prometheus, Elasticsearch, Quickwit, and emerging OpenTelemetry tooling.
  • Design and automate scalable pipelines (metrics, traces, logs, events) so every engineer has sub-second, queryable visibility into production.
  • Drive upgrades, capacity modeling, and policy enforcement for our dedicated observability-focused clusters; introduce best-in-class patterns for multi-tenant isolation and cost optimization.
  • Contribute production-quality Go or Python services, operators, and Terraform modules that elevate reliability, performance, and developer velocity.
  • Partner with service owners to embed observability into their SDLC, guide best practices, perform instrumentation reviews, and elevate on-call readiness across the org.
  • Reduce MTTR, noise, and waste by designing cost-efficient telemetry architectures, high-signal alerting, and automated recovery patterns.
  • Lead and model operational excellence through on-call participation, post-incident reviews, and continuous improvement initiatives.

Requirements

  • Proven ability to architect and manage production-grade Kubernetes (EKS) clusters, specifically for stateful workloads.
  • Proficiency of Infrastructure-as-code (IaC), including Terraform.
  • Deep production experience with Elasticsearch, Prometheus, or OpenTelemetry. You know how to tune these systems for multi-terabyte daily workloads.
  • Proficiency in Go or Python to build custom operators, internal tools, and automation.
  • Ability to optimize ingestion and storage for logs, metrics, and traces while balancing query performance with cost-efficiency.
  • Ability to influence engineering culture by mentoring peers and partnering with service owners to improve their observability posture.
  • A humble, collaborative approach to problem-solving and a bias toward systemic, automated solutions.
Benefits
  • Competitive salaries & meaningful equity
  • Private Medical Insurance
  • Life/Risk Assurance
  • Meal Allowance: 8.55€ per day
  • Community Days (additional paid holidays)
  • Paid Annual Leave (22 days)
  • Paid Sabbatical (after 4 years tenure)
  • Initial laptop workstation setup
  • Teleworking Allowance

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills
GoPythonTerraformKubernetesElasticsearchPrometheusOpenTelemetryInfrastructure-as-codescalable pipelinestelemetry architectures
Soft skills
collaborationmentoringproblem-solvinginfluenceoperational excellencecontinuous improvementcommunicationleadershipcost optimizationsystemic solutions