
Site Reliability Engineer, Senior – Observability
ASAAS
full-time
Posted on:
Location Type: Remote
Location: Brazil
Visit company websiteExplore more
Job Level
About the role
- Design, implement and evolve the company’s observability platform, covering the three pillars: metrics, logs and traces
- Implement and maintain observability stacks
- Define and implement instrumentation standards for applications and infrastructure
- Create strategic and operational dashboards that provide actionable insights to teams
- Define, monitor and manage Service Level Indicators (SLIs) and Service Level Objectives (SLOs), managing Error Budgets
- Implement intelligent alerting systems that reduce noise and focus on actionable alerts
- Collaborate with development teams to improve application observability, promoting instrumentation best practices
- Lead incident response from the observability perspective, ensuring rapid identification of root cause
- Conduct detailed post-mortem analyses and propose improvements based on observability data
- Promote and disseminate an observability culture and SRE best practices across the organization
- Plan and execute capacity management strategies based on metrics
- Optimize cost and performance of observability solutions at scale
- Automate processes for collecting, processing and visualizing observability data
- Document architectures, runbooks and procedures related to observability.
Requirements
- Strong experience implementing and managing observability platforms at scale
- Deep knowledge of Prometheus, including PromQL, service discovery, federation and remote_write
- Advanced experience with Grafana for building dashboards, alerts and managing data sources
- Knowledge of distributed tracing (Jaeger, Tempo, X-Ray) and correlation between metrics, logs and traces
- Experience with OpenTelemetry for instrumenting applications
- Knowledge of scalable logging solutions (Loki, ELK Stack, CloudWatch Logs)
- Experience with Cloud Computing, especially AWS
- Experience with containers (Docker) and orchestration (Kubernetes, ECS)
- Hands-on experience with Infrastructure as Code (IaC) (AWS CDK, Terraform)
- Knowledge of SRE practices, including SLIs, SLOs, Error Budgets and toil reduction
- Proficiency in scripting languages (Python, Bash) and at least one programming language (Go, Java)
- Understanding of Linux systems and their diagnostic tools
- Experience in incident management and post-mortem processes.
Benefits
- Medical and dental insurance with no co-pay
- Life insurance
- Medication assistance
- Allowance for fitness activities
- Partnership with Neon for financial wellness
- Partnership with Zenklub for physical and mental health (4 free monthly sessions with a therapist or nutritionist)
- Free meals at headquarters
- Flexible meal allowance via credit card
- Childcare assistance
- Parental support program
- Extended maternity and paternity leave
- Work-from-home allowance
- Work equipment
- Furniture allowance
- Partnerships with coworking spaces during remote work
- Day off during your birthday month
- Happy Hour allowance
- Referral bonus for new hires
- Bonus based on annual targets
- Stock Options plan
- Relaxed work environment, no dress code
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
observability platformsPrometheusPromQLGrafanadistributed tracingOpenTelemetryscalable logging solutionsCloud ComputingInfrastructure as Codescripting languages
Soft Skills
collaborationincident responsepost-mortem analysisstrategic planningoperational insightscommunicationleadershipproblem-solvingcapacity managementprocess automation