
Senior Site Reliability Engineer, SRE
Alternative Payments
full-time
Posted on:
Location Type: Remote
Location: Brazil
Visit company websiteExplore more
Salary
💰 $72,000 - $90,000 per year
Job Level
Tech Stack
About the role
- Leading and executing on key reliability initiatives from planning to delivery, particularly focusing on monitoring, alerting, and incident response for the Firefighters team.
- Monitoring and Alerting Setup: Configuring comprehensive alerting systems, including queue monitoring and service health checks.
- Metrics and Dashboards: Building performance dashboards, implementing load testing, and creating capacity metrics for presentations.
- Observability Enhancement: Implementing end-to-end traceability with distributed tracing and service profiling.
- Infrastructure Automation: Working on pipeline improvements, moving to strength-based pipelines.
- Datadog Integration: Continuing the migration back to Datadog and optimizing our monitoring stack.
- Collaborating with cross-functional teams to deliver scalable solutions, optimize processes, and implement highly reliable systems.
- Taking ownership of complex SRE tasks, including configuring monitoring systems, defining and enforcing SLIs, SLOs, and SLAs.
- Proposing improvements and helping establish best practices, workflows, and standards for incident response, blameless post-mortems, and continuous improvement.
Requirements
- 7-10+ years of experience in Site Reliability Engineering, DevOps, or a similar role focused on large-scale distributed systems.
- Strong skills in Kubernetes for container orchestration and cluster management.
- Extensive experience with AWS as a core cloud platform for infrastructure management.
- Critical proficiency with Datadog for monitoring, logging, tracing, and alerting.
- Proven experience in designing, implementing, and optimizing CI/CD pipelines, ideally with GitHub Pipelines.
- Strong understanding and practical application of SRE principles: SLI/SLO/SLA definition, error budget management, incident response, post-mortem analysis, and toil reduction.
- A proactive mindset with the ability to solve complex problems, drive projects independently, and continuously innovate our reliability practices.
- Strong communication skills, especially in English, to collaborate effectively across technical teams and stakeholders.
Benefits
- Competitive salary tailored to your experience, skills, and expertise.
- Equity opportunities so you can share in our growth and success.
- Unlimited PTO and flexibility when you need it the most.
- Referral bonus. We truly believe we hire fantastic people, and great talent recognizes great talent. We offer a significant bonus for your hired referral.
- Yearly learning & development stipend to help you grow and do your best work.
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
Site Reliability EngineeringDevOpsKubernetesAWSDatadogCI/CD pipelinesGitHub Pipelinesmonitoring systemsload testingdistributed tracing
Soft Skills
problem solvingproject managementcommunicationcollaborationproactive mindsetcontinuous improvementownershipinnovationcross-functional teamworkblameless post-mortems