Senior Site Reliability Engineer

Darede

Senior SRE responsible for transitioning operations to a reliability culture in a cloud environment. Design and implement solutions to prevent system failures for business-critical applications.

Posted 4/27/2026full-timeRemote • 🇧🇷 BrazilSeniorWebsite

Tech Stack

Tools & technologies

AWSDockerEC2GoKubernetes.NETOraclePostgresPythonTerraform

About the role

Key responsibilities & impact

**Incident Leadership:**
Act as Incident Response Lead in War Rooms, coordinating technical remediation and communication with stakeholders.
**Observability Engineering:**
Design and evolve telemetry in Datadog (Logs, APM, Traces and business metrics) to reduce MTTD and the team's cognitive load.
**Workload Management on AWS Amplify:**
Ensure the resilience and scalability of hosted front-end applications and critical APIs.
**SRE Governance:**
Define and monitor SLIs, SLOs and SLAs, managing the Error Budget to balance delivery speed with stability.
**Mitigation Automation:**
Develop auto-healing tools and scripts (automatic rollback, controlled restart, component isolation).
**Root Cause Analysis:**
Lead blameless post-mortem processes and ensure the implementation of structural improvements to prevent recurrence.
**Systems Modernization:**
Work with development teams to implement resilience patterns (Circuit Breakers, Bulkheads and Rate Limiting) in both modern architectures and legacy systems.
**AI in Operations:**
Implement anomaly detection and intelligent response solutions using AIOps (Datadog Bits AI or AWS DevOps Agent).

Requirements

What you’ll need

**Proven Seniority in SRE or DevOps:** Solid experience in high-scale, mission-critical environments.
**Deep AWS Expertise:** Advanced experience with EC2, RDS, S3, IAM, EKS and Amplify.
**Observability Tools:** Strong experience in monitoring, logging and APM (preferably using Datadog).
**Containers & Orchestration:** Strong knowledge of Docker and Kubernetes (EKS/GKE).
**Infrastructure as Code (IaC):** Proficiency in Terraform.
**Development/Scripting:** Proficient in Python, Go or Shell scripting for automation.
**Incident Management:** Real experience with on-call rotations and real-time problem resolution.
**Plus / Nice-to-haves:**
**Analytical Profile for Legacy Systems:** Experience troubleshooting .NET Framework applications and Oracle or PostgreSQL databases.
**Chaos Engineering:** Experience executing controlled stress and resilience tests.
**Certifications:** AWS Certified DevOps Engineer - Professional or official Datadog certifications.

Benefits

Comp & perks

📚 Educational Incentives (Partnerships with Educational Institutions)
🌴 Paid Vacation
🏋️ TotalPass
🎂 Birthday off
🏥 Health Insurance
🦷 Dental Insurance
🤰 Maternity Leave
👨‍👩‍👧‍👦 Paternity Leave
🌟 Reimbursement for AWS Certifications

ATS Keywords

✓ Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

Incident ResponseTelemetry DesignAWS AmplifySLIsSLOsSLAsAuto-healing ToolsPythonGoTerraform

Soft Skills

LeadershipCommunicationAnalytical ThinkingProblem Resolution

Certifications

AWS Certified DevOps Engineer - ProfessionalDatadog Certifications