Apply

Ready to go for it?

AI Apply speeds things up—apply directly if you prefer.

FREE ACCESS
5,000–10,000 jobs/day
JobTailor Logo

See all jobs on JobTailor

Search thousands of fresh jobs every day.

Discover
  • Fresh listings
  • Fast filters
  • No subscription required
Create a free account and start exploring right away.
CMG (Capital Markets Gateway)

Site Reliability Engineer

CMG (Capital Markets Gateway)

Site Reliability Engineer focusing on monitoring, observability, and alerting at CMG, a fintech transforming equity capital markets.

Posted 5/29/2026full-timeRemote • 🇨🇦 CanadaMid-LevelSeniorWebsite

Tech Stack

Tools & technologies
AzureCloudDockerGrafanaKubernetesLinuxPostgresPrometheusPythonTerraform

About the role

Key responsibilities & impact
  • Design, implement, and maintain monitoring and observability solutions using tools like Prometheus, Grafana Stack (Loki/Grafana/Tempo/Alert Manager), Datadog, and OpenTelemetry.
  • Define and implement SLOs, SLIs, and error budgets to measure system reliability.
  • Develop and optimize dashboards, alerts, and reports for system performance and business metrics.
  • Design actionable alerting strategies to minimize noise and improve MTTR.
  • Integrate alerting systems with Jira.
  • Establish and refine runbooks for on-call teams to handle alerts efficiently.
  • Empower teams to ensure observability coverage and incident response practices.
  • Analyze system performance metrics, identify bottlenecks, and implement optimizations to improve system efficiency, scalability, and cost-effectiveness.
  • Help conduct load testing and capacity planning to ensure systems can handle peak traffic loads.
  • Identify opportunities for automation and develop tools to streamline operational processes, such as fail-over, configuration management, and monitoring.
  • Implement monitoring and alerting systems within automations to detect and resolve issues proactively.
  • Collaborate closely with cross-functional teams, including software engineers, operations, and infrastructure teams, to understand system requirements, provide technical guidance, and drive solutions.
  • Communicate effectively to stakeholders about system changes, incidents, and improvements.
  • Foment and spread SRE principles and practices across the company.

Requirements

What you’ll need
  • Must be based in Latin America
  • English level - C1 or C2
  • Proven experience as a Site Reliability Engineer or similar role.
  • Proficiency in logging, metrics, and tracing frameworks (DataDog, Loki, Prometheus, OpenTelemetry).
  • Experience with cloud platforms (Azure preferred) and infrastructure-as-code tools (e.g., Terraform).
  • Strong programming and scripting skills (Python, Bash).
  • Proficiency in containerization technologies and orchestration tools (Docker, Kubernetes).
  • Understanding of Linux-based systems, networking, and security principles related to containerized applications.
  • Strong problem-solving and troubleshooting skills, with a passion for identifying and resolving complex technical issues.
  • Excellent communication and collaboration abilities.
  • Ability to thrive in a fast-paced, constantly evolving environment.
  • Experience with PostgreSQL monitoring and optimization (Optional/Nice to have).

Benefits

Comp & perks
  • Equity
  • Unlimited PTO (15 days + bank holidays + unlimited additional paid leave)
  • Comprehensive benefits program managed by Globalization Partners
  • Premium life and income protection
  • Top private medical and dental insurance
  • Employee Assistance Program (EAP)
  • Pension contributions
  • Remote work environment
  • Education reimbursement
  • Continuous learning opportunities
  • Employee referral bonus
  • Parental leave

ATS Keywords

✓ Tailor your resume
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
monitoring solutionsobservability solutionsSLOsSLIserror budgetsload testingcapacity planningPythonBashTerraform
Soft Skills
problem-solvingtroubleshootingcommunicationcollaborationadaptability