ASAAS

Site Reliability Engineer, Senior – Observability

ASAAS

full-time

Posted on:

Location Type: Remote

Location: Brazil

Visit company website

Explore more

AI Apply
Apply

Job Level

About the role

  • Design, implement and evolve the company’s observability platform, covering the three pillars: metrics, logs and traces
  • Implement and maintain observability stacks
  • Define and implement instrumentation standards for applications and infrastructure
  • Create strategic and operational dashboards that provide actionable insights to teams
  • Define, monitor and manage Service Level Indicators (SLIs) and Service Level Objectives (SLOs), managing Error Budgets
  • Implement intelligent alerting systems that reduce noise and focus on actionable alerts
  • Collaborate with development teams to improve application observability, promoting instrumentation best practices
  • Lead incident response from the observability perspective, ensuring rapid identification of root cause
  • Conduct detailed post-mortem analyses and propose improvements based on observability data
  • Promote and disseminate an observability culture and SRE best practices across the organization
  • Plan and execute capacity management strategies based on metrics
  • Optimize cost and performance of observability solutions at scale
  • Automate processes for collecting, processing and visualizing observability data
  • Document architectures, runbooks and procedures related to observability.

Requirements

  • Strong experience implementing and managing observability platforms at scale
  • Deep knowledge of Prometheus, including PromQL, service discovery, federation and remote_write
  • Advanced experience with Grafana for building dashboards, alerts and managing data sources
  • Knowledge of distributed tracing (Jaeger, Tempo, X-Ray) and correlation between metrics, logs and traces
  • Experience with OpenTelemetry for instrumenting applications
  • Knowledge of scalable logging solutions (Loki, ELK Stack, CloudWatch Logs)
  • Experience with Cloud Computing, especially AWS
  • Experience with containers (Docker) and orchestration (Kubernetes, ECS)
  • Hands-on experience with Infrastructure as Code (IaC) (AWS CDK, Terraform)
  • Knowledge of SRE practices, including SLIs, SLOs, Error Budgets and toil reduction
  • Proficiency in scripting languages (Python, Bash) and at least one programming language (Go, Java)
  • Understanding of Linux systems and their diagnostic tools
  • Experience in incident management and post-mortem processes.
Benefits
  • Medical and dental insurance with no co-pay
  • Life insurance
  • Medication assistance
  • Allowance for fitness activities
  • Partnership with Neon for financial wellness
  • Partnership with Zenklub for physical and mental health (4 free monthly sessions with a therapist or nutritionist)
  • Free meals at headquarters
  • Flexible meal allowance via credit card
  • Childcare assistance
  • Parental support program
  • Extended maternity and paternity leave
  • Work-from-home allowance
  • Work equipment
  • Furniture allowance
  • Partnerships with coworking spaces during remote work
  • Day off during your birthday month
  • Happy Hour allowance
  • Referral bonus for new hires
  • Bonus based on annual targets
  • Stock Options plan
  • Relaxed work environment, no dress code
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
observability platformsPrometheusPromQLGrafanadistributed tracingOpenTelemetryscalable logging solutionsCloud ComputingInfrastructure as Codescripting languages
Soft Skills
collaborationincident responsepost-mortem analysisstrategic planningoperational insightscommunicationleadershipproblem-solvingcapacity managementprocess automation