FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.

Site Reliability Engineer
CMG (Capital Markets Gateway)Site Reliability Engineer focusing on monitoring, observability, and alerting at CMG, a fintech transforming equity capital markets.
Tech Stack
Tools & technologiesAzureCloudDockerGrafanaKubernetesLinuxPostgresPrometheusPythonTerraform
About the role
Key responsibilities & impact- Design, implement, and maintain monitoring and observability solutions using tools like Prometheus, Grafana Stack (Loki/Grafana/Tempo/Alert Manager), Datadog, and OpenTelemetry.
- Define and implement SLOs, SLIs, and error budgets to measure system reliability.
- Develop and optimize dashboards, alerts, and reports for system performance and business metrics.
- Design actionable alerting strategies to minimize noise and improve MTTR.
- Integrate alerting systems with Jira.
- Establish and refine runbooks for on-call teams to handle alerts efficiently.
- Empower teams to ensure observability coverage and incident response practices.
- Analyze system performance metrics, identify bottlenecks, and implement optimizations to improve system efficiency, scalability, and cost-effectiveness.
- Help conduct load testing and capacity planning to ensure systems can handle peak traffic loads.
- Identify opportunities for automation and develop tools to streamline operational processes, such as fail-over, configuration management, and monitoring.
- Implement monitoring and alerting systems within automations to detect and resolve issues proactively.
- Collaborate closely with cross-functional teams, including software engineers, operations, and infrastructure teams, to understand system requirements, provide technical guidance, and drive solutions.
- Communicate effectively to stakeholders about system changes, incidents, and improvements.
- Foment and spread SRE principles and practices across the company.
Requirements
What you’ll need- Must be based in Latin America
- English level - C1 or C2
- Proven experience as a Site Reliability Engineer or similar role.
- Proficiency in logging, metrics, and tracing frameworks (DataDog, Loki, Prometheus, OpenTelemetry).
- Experience with cloud platforms (Azure preferred) and infrastructure-as-code tools (e.g., Terraform).
- Strong programming and scripting skills (Python, Bash).
- Proficiency in containerization technologies and orchestration tools (Docker, Kubernetes).
- Understanding of Linux-based systems, networking, and security principles related to containerized applications.
- Strong problem-solving and troubleshooting skills, with a passion for identifying and resolving complex technical issues.
- Excellent communication and collaboration abilities.
- Ability to thrive in a fast-paced, constantly evolving environment.
- Experience with PostgreSQL monitoring and optimization (Optional/Nice to have).
Benefits
Comp & perks- Equity
- Unlimited PTO (15 days + bank holidays + unlimited additional paid leave)
- Comprehensive benefits program managed by Globalization Partners
- Premium life and income protection
- Top private medical and dental insurance
- Employee Assistance Program (EAP)
- Pension contributions
- Remote work environment
- Education reimbursement
- Continuous learning opportunities
- Employee referral bonus
- Parental leave
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
monitoring solutionsobservability solutionsSLOsSLIserror budgetsload testingcapacity planningPythonBashTerraform
Soft Skills
problem-solvingtroubleshootingcommunicationcollaborationadaptability