
Senior Site Reliability Engineer, Observability
Iterable
full-time
Posted on:
Location Type: Hybrid
Location: Lisbon • 🇵🇹 Portugal
Visit company websiteJob Level
Senior
Tech Stack
ElasticSearchGoGrafanaKubernetesPrometheusPythonSDLCTerraform
About the role
- Collaborate deeply with product teams to ensure the frameworks we provide actually solve their problems.
- Own the long-term roadmap for Datadog, Grafana, Prometheus, Elasticsearch, Quickwit, and emerging OpenTelemetry tooling.
- Design and automate scalable pipelines (metrics, traces, logs, events) so every engineer has sub-second, queryable visibility into production.
- Drive upgrades, capacity modeling, and policy enforcement for our dedicated observability-focused clusters; introduce best-in-class patterns for multi-tenant isolation and cost optimization.
- Contribute production-quality Go or Python services, operators, and Terraform modules that elevate reliability, performance, and developer velocity.
- Partner with service owners to embed observability into their SDLC, guide best practices, perform instrumentation reviews, and elevate on-call readiness across the org.
- Reduce MTTR, noise, and waste by designing cost-efficient telemetry architectures, high-signal alerting, and automated recovery patterns.
- Lead and model operational excellence through on-call participation, post-incident reviews, and continuous improvement initiatives.
Requirements
- Proven ability to architect and manage production-grade Kubernetes (EKS) clusters, specifically for stateful workloads.
- Proficiency of Infrastructure-as-code (IaC), including Terraform.
- Deep production experience with Elasticsearch, Prometheus, or OpenTelemetry. You know how to tune these systems for multi-terabyte daily workloads.
- Proficiency in Go or Python to build custom operators, internal tools, and automation.
- Ability to optimize ingestion and storage for logs, metrics, and traces while balancing query performance with cost-efficiency.
- Ability to influence engineering culture by mentoring peers and partnering with service owners to improve their observability posture.
- A humble, collaborative approach to problem-solving and a bias toward systemic, automated solutions.
Benefits
- Competitive salaries & meaningful equity
- Private Medical Insurance
- Life/Risk Assurance
- Meal Allowance: 8.55€ per day
- Community Days (additional paid holidays)
- Paid Annual Leave (22 days)
- Paid Sabbatical (after 4 years tenure)
- Initial laptop workstation setup
- Teleworking Allowance
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
GoPythonTerraformKubernetesElasticsearchPrometheusOpenTelemetryInfrastructure-as-codescalable pipelinestelemetry architectures
Soft skills
collaborationmentoringproblem-solvinginfluenceoperational excellencecontinuous improvementcommunicationleadershipcost optimizationsystemic solutions