Alternative Payments

Senior Site Reliability Engineer, SRE

Alternative Payments

full-time

Posted on:

Location Type: Remote

Location: Brazil

Visit company website

Explore more

AI Apply
Apply

Salary

💰 $72,000 - $90,000 per year

Job Level

About the role

  • Leading and executing on key reliability initiatives from planning to delivery, particularly focusing on monitoring, alerting, and incident response for the Firefighters team.
  • Monitoring and Alerting Setup: Configuring comprehensive alerting systems, including queue monitoring and service health checks.
  • Metrics and Dashboards: Building performance dashboards, implementing load testing, and creating capacity metrics for presentations.
  • Observability Enhancement: Implementing end-to-end traceability with distributed tracing and service profiling.
  • Infrastructure Automation: Working on pipeline improvements, moving to strength-based pipelines.
  • Datadog Integration: Continuing the migration back to Datadog and optimizing our monitoring stack.
  • Collaborating with cross-functional teams to deliver scalable solutions, optimize processes, and implement highly reliable systems.
  • Taking ownership of complex SRE tasks, including configuring monitoring systems, defining and enforcing SLIs, SLOs, and SLAs.
  • Proposing improvements and helping establish best practices, workflows, and standards for incident response, blameless post-mortems, and continuous improvement.

Requirements

  • 7-10+ years of experience in Site Reliability Engineering, DevOps, or a similar role focused on large-scale distributed systems.
  • Strong skills in Kubernetes for container orchestration and cluster management.
  • Extensive experience with AWS as a core cloud platform for infrastructure management.
  • Critical proficiency with Datadog for monitoring, logging, tracing, and alerting.
  • Proven experience in designing, implementing, and optimizing CI/CD pipelines, ideally with GitHub Pipelines.
  • Strong understanding and practical application of SRE principles: SLI/SLO/SLA definition, error budget management, incident response, post-mortem analysis, and toil reduction.
  • A proactive mindset with the ability to solve complex problems, drive projects independently, and continuously innovate our reliability practices.
  • Strong communication skills, especially in English, to collaborate effectively across technical teams and stakeholders.
Benefits
  • Competitive salary tailored to your experience, skills, and expertise.
  • Equity opportunities so you can share in our growth and success.
  • Unlimited PTO and flexibility when you need it the most.
  • Referral bonus. We truly believe we hire fantastic people, and great talent recognizes great talent. We offer a significant bonus for your hired referral.
  • Yearly learning & development stipend to help you grow and do your best work.
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
Site Reliability EngineeringDevOpsKubernetesAWSDatadogCI/CD pipelinesGitHub Pipelinesmonitoring systemsload testingdistributed tracing
Soft Skills
problem solvingproject managementcommunicationcollaborationproactive mindsetcontinuous improvementownershipinnovationcross-functional teamworkblameless post-mortems