Apply

Ready to go for it?

AI Apply speeds things up—apply directly if you prefer.

FREE ACCESS
5,000–10,000 jobs/day
JobTailor Logo

See all jobs on JobTailor

Search thousands of fresh jobs every day.

Discover
  • Fresh listings
  • Fast filters
  • No subscription required
Create a free account and start exploring right away.
CMG (Capital Markets Gateway)

Site Reliability Engineer

CMG (Capital Markets Gateway)

Site Reliability Engineer focusing on monitoring, observability, and alerting in capital markets fintech. Designing and implementing solutions to enhance system reliability and performance.

Posted 5/29/2026full-time💃 Anywhere in Latin AmericaMid-LevelSeniorWebsite

Tech Stack

Tools & technologies
AzureCloudDockerGrafanaKubernetesLinuxPostgresPrometheusPythonTerraform

About the role

Key responsibilities & impact
  • Design, implement, and maintain monitoring and observability solutions using tools like Prometheus, Grafana Stack (Loki/Grafana/Tempo/Alert Manager), Datadog, and OpenTelemetry.
  • Define and implement SLOs, SLIs, and error budgets to measure system reliability.
  • Develop and optimize dashboards, alerts, and reports for system performance and business metrics.
  • Design actionable alerting strategies to minimize noise and improve MTTR.
  • Integrate alerting systems with Jira.
  • Establish and refine runbooks for on-call teams to handle alerts efficiently.
  • Empower teams to ensure observability coverage and incident response practices.
  • Analyze system performance metrics, identify bottlenecks, and implement optimizations to improve system efficiency, scalability, and cost-effectiveness.
  • Help conduct load testing and capacity planning to ensure systems can handle peak traffic loads.
  • Identify opportunities for automation and develop tools to streamline operational processes, such as fail-over, configuration management, and monitoring.
  • Implement monitoring and alerting systems within automations to detect and resolve issues proactively.
  • Collaborate closely with cross-functional teams, including software engineers, operations, and infrastructure teams, to understand system requirements, provide technical guidance, and drive solutions.
  • Communicate effectively to stakeholders about system changes, incidents, and improvements.
  • Foment and spread SRE principles and practices across company.

Requirements

What you’ll need
  • Must be based in Latin America
  • English level - C1 or C2
  • Proven experience as a Site Reliability Engineer or similar role.
  • Proficiency in logging, metrics, and tracing frameworks (DataDog, Loki, Prometheus, OpenTelemetry).
  • Experience with cloud platforms (Azure preferred) and infrastructure-as-code tools (e.g., Terraform).
  • Strong programming and scripting skills (Python, Bash).
  • Proficiency in containerization technologies and orchestration tools (Docker, Kubernetes).
  • Understanding of Linux-based systems, networking, and security principles related to containerized applications.
  • Strong problem-solving and troubleshooting skills, with a passion for identifying and resolving complex technical issues.
  • Excellent communication and collaboration abilities.
  • Ability to thrive in a fast-paced, constantly evolving environment.
  • Experience with PostgreSQL monitoring and optimization (Optional/Nice to have).

Benefits

Comp & perks
  • Equity
  • Unlimited PTO (28 days including bank holidays + unlimited additional paid leave)
  • Comprehensive benefits program managed by Globalization Partners
  • Premium life and income protection
  • Top private medical and dental insurance
  • Employee Assistance Program (EAP)
  • Pension contributions
  • Hybrid work environment (initially remote until office setup is complete)
  • Education reimbursement
  • Continuous learning opportunities
  • Employee referral bonus
  • Parental leave

ATS Keywords

✓ Tailor your resume
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
monitoring solutionsobservability solutionsSLOsSLIserror budgetsload testingcapacity planningautomationPythonBash
Soft Skills
problem-solvingtroubleshootingcommunicationcollaborationadaptability