Site Reliability Engineer, Senior – Observability

ASAAS

full-time

Posted on: 1/22/2026

Location Type: Remote

Location: Brazil

Visit company website

Explore more

DevOps Engineer jobs

✨ AI Apply

Apply

Job Level

Senior

Tech Stack

AWS Cloud Docker Go Grafana Java Kubernetes Linux Prometheus Python Ray Terraform

About the role

Design, implement and evolve the company’s observability platform, covering the three pillars: metrics, logs and traces
Implement and maintain observability stacks
Define and implement instrumentation standards for applications and infrastructure
Create strategic and operational dashboards that provide actionable insights to teams
Define, monitor and manage Service Level Indicators (SLIs) and Service Level Objectives (SLOs), managing Error Budgets
Implement intelligent alerting systems that reduce noise and focus on actionable alerts
Collaborate with development teams to improve application observability, promoting instrumentation best practices
Lead incident response from the observability perspective, ensuring rapid identification of root cause
Conduct detailed post-mortem analyses and propose improvements based on observability data
Promote and disseminate an observability culture and SRE best practices across the organization
Plan and execute capacity management strategies based on metrics
Optimize cost and performance of observability solutions at scale
Automate processes for collecting, processing and visualizing observability data
Document architectures, runbooks and procedures related to observability.

Requirements

Strong experience implementing and managing observability platforms at scale
Deep knowledge of Prometheus, including PromQL, service discovery, federation and remote_write
Advanced experience with Grafana for building dashboards, alerts and managing data sources
Knowledge of distributed tracing (Jaeger, Tempo, X-Ray) and correlation between metrics, logs and traces
Experience with OpenTelemetry for instrumenting applications
Knowledge of scalable logging solutions (Loki, ELK Stack, CloudWatch Logs)
Experience with Cloud Computing, especially AWS
Experience with containers (Docker) and orchestration (Kubernetes, ECS)
Hands-on experience with Infrastructure as Code (IaC) (AWS CDK, Terraform)
Knowledge of SRE practices, including SLIs, SLOs, Error Budgets and toil reduction
Proficiency in scripting languages (Python, Bash) and at least one programming language (Go, Java)
Understanding of Linux systems and their diagnostic tools
Experience in incident management and post-mortem processes.

Benefits

Medical and dental insurance with no co-pay
Life insurance
Medication assistance
Allowance for fitness activities
Partnership with Neon for financial wellness
Partnership with Zenklub for physical and mental health (4 free monthly sessions with a therapist or nutritionist)
Free meals at headquarters
Flexible meal allowance via credit card
Childcare assistance
Parental support program
Extended maternity and paternity leave
Work-from-home allowance
Work equipment
Furniture allowance
Partnerships with coworking spaces during remote work
Day off during your birthday month
Happy Hour allowance
Referral bonus for new hires
Bonus based on annual targets
Stock Options plan
Relaxed work environment, no dress code

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

observability platformsPrometheusPromQLGrafanadistributed tracingOpenTelemetryscalable logging solutionsCloud ComputingInfrastructure as Codescripting languages

Soft Skills

collaborationincident responsepost-mortem analysisstrategic planningoperational insightscommunicationleadershipproblem-solvingcapacity managementprocess automation