
Site Reliability Engineer
CMG (Capital Markets Gateway)
contract
Posted on:
Location Type: Remote
Location: Brazil
Visit company websiteExplore more
About the role
- Design, implement, and maintain monitoring and observability solutions using tools like Prometheus, Grafana Stack (Loki/Grafana/Tempo/Alert Manager), Datadog, and OpenTelemetry.
- Define and implement SLOs, SLIs, and error budgets to measure system reliability.
- Develop and optimize dashboards, alerts, and reports for system performance and business metrics.
- Design actionable alerting strategies to minimize noise and improve MTTR.
- Integrate alerting systems with Jira.
- Establish and refine runbooks for on-call teams to handle alerts efficiently.
- Empower teams to ensure observability coverage and incident response practices.
- Analyze system performance metrics, identify bottlenecks, and implement optimizations to improve system efficiency, scalability, and cost-effectiveness.
- Help conduct load testing and capacity planning to ensure systems can handle peak traffic loads.
- Identify opportunities for automation and develop tools to streamline operational processes, such as fail-over, configuration management, and monitoring.
- Implement monitoring and alerting systems within automations to detect and resolve issues proactively.
- Collaborate closely with cross-functional teams, including software engineers, operations, and infrastructure teams, to understand system requirements, provide technical guidance, and drive solutions.
- Communicate effectively to stakeholders about system changes, incidents, and improvements.
- Foment and spread SRE principles and practices across the company.
Requirements
- Must be based in Latin America
- English level - C1 or C2
- Proven experience as a Site Reliability Engineer or similar role.
- Proficiency in logging, metrics, and tracing frameworks (DataDog, Loki, Prometheus, OpenTelemetry).
- Experience with cloud platforms (Azure preferred) and infrastructure-as-code tools (e.g., Terraform).
- Strong programming and scripting skills (Python, Bash).
- Proficiency in containerization technologies and orchestration tools (Docker, Kubernetes).
- Understanding of Linux-based systems, networking, and security principles related to containerized applications.
- Strong problem-solving and troubleshooting skills, with a passion for identifying and resolving complex technical issues.
- Excellent communication and collaboration abilities.
- Ability to thrive in a fast-paced, constantly evolving environment.
- Experience with PostgreSQL monitoring and optimization (Optional/Nice to have).
Benefits
- 2 year+ contract.
- 15 business days of vacation.
- Tech courses and conferences.
- Top-of-the-line MacBook.
- Flexible working hours.
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
monitoring solutionsobservability solutionsSLOsSLIserror budgetsload testingcapacity planningautomationPythonBash
Soft Skills
problem-solvingtroubleshootingcommunicationcollaborationadaptability