CMG (Capital Markets Gateway)

Site Reliability Engineer

CMG (Capital Markets Gateway)

contract

Posted on:

Location Type: Remote

Location: Brazil

Visit company website

Explore more

AI Apply
Apply

About the role

  • Design, implement, and maintain monitoring and observability solutions using tools like Prometheus, Grafana Stack (Loki/Grafana/Tempo/Alert Manager), Datadog, and OpenTelemetry.
  • Define and implement SLOs, SLIs, and error budgets to measure system reliability.
  • Develop and optimize dashboards, alerts, and reports for system performance and business metrics.
  • Design actionable alerting strategies to minimize noise and improve MTTR.
  • Integrate alerting systems with Jira.
  • Establish and refine runbooks for on-call teams to handle alerts efficiently.
  • Empower teams to ensure observability coverage and incident response practices.
  • Analyze system performance metrics, identify bottlenecks, and implement optimizations to improve system efficiency, scalability, and cost-effectiveness.
  • Help conduct load testing and capacity planning to ensure systems can handle peak traffic loads.
  • Identify opportunities for automation and develop tools to streamline operational processes, such as fail-over, configuration management, and monitoring.
  • Implement monitoring and alerting systems within automations to detect and resolve issues proactively.
  • Collaborate closely with cross-functional teams, including software engineers, operations, and infrastructure teams, to understand system requirements, provide technical guidance, and drive solutions.
  • Communicate effectively to stakeholders about system changes, incidents, and improvements.
  • Foment and spread SRE principles and practices across the company.

Requirements

  • Must be based in Latin America
  • English level - C1 or C2
  • Proven experience as a Site Reliability Engineer or similar role.
  • Proficiency in logging, metrics, and tracing frameworks (DataDog, Loki, Prometheus, OpenTelemetry).
  • Experience with cloud platforms (Azure preferred) and infrastructure-as-code tools (e.g., Terraform).
  • Strong programming and scripting skills (Python, Bash).
  • Proficiency in containerization technologies and orchestration tools (Docker, Kubernetes).
  • Understanding of Linux-based systems, networking, and security principles related to containerized applications.
  • Strong problem-solving and troubleshooting skills, with a passion for identifying and resolving complex technical issues.
  • Excellent communication and collaboration abilities.
  • Ability to thrive in a fast-paced, constantly evolving environment.
  • Experience with PostgreSQL monitoring and optimization (Optional/Nice to have).
Benefits
  • 2 year+ contract.
  • 15 business days of vacation.
  • Tech courses and conferences.
  • Top-of-the-line MacBook.
  • Flexible working hours.
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
monitoring solutionsobservability solutionsSLOsSLIserror budgetsload testingcapacity planningautomationPythonBash
Soft Skills
problem-solvingtroubleshootingcommunicationcollaborationadaptability