Site Reliability Engineer

Tecsys Inc.

Site Reliability Engineer maintaining cloud infrastructure reliability for Tecsys solutions. Collaborating across teams to support services and implement automation, observability, and frameworks.

Posted 5/4/2026full-timeMontreal • 🇨🇦 CanadaMid-LevelSeniorWebsite

Tech Stack

Tools & technologies

AnsibleAWSCloudEC2JavaJenkinsKubernetesPythonTerraform

About the role

Key responsibilities & impact

Collaborate with other Engineering teams to support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning and launch reviews.
Innovate relentlessly: Identify pain points, propose creative solutions, and drive initiatives that simplify, scale, and strengthen the platform.
Maintain services once they are live by measuring and monitoring availability, latency and overall system health.
Own observability: Enhance and expand monitoring and alerting using Datadog; define SLOs/SLIs and create actionable dashboards that drive reliability outcomes.
Drive automation: Develop and improve internal tooling, IaC frameworks, and pipelines (Terraform, GitLab CI/CD) to reduce manual intervention and enable self-healing systems.
Scale systems sustainably through automation and evolve systems by pushing for changes that improve reliability and velocity.
Act as an agent orchestrator using Amazon Kiro: run multiple activities in parallel by leveraging AI agents to accelerate execution, while personally validating results and completing selected tasks manually when needed.
Be on-call.
Practice sustainable incident response and blameless postmortems. Lead post-incident reviews (RCAs) and identify long-term fixes that improve stability, reliability, and developer experience.
Implement monitoring, Logging, alerting, and SLA Reporting.
Create and maintain technical documentation.
Implement, maintain and mature SRE best practices.
Lead incidents: Act as Incident Commander for Incidents; coordinate cross-team response, manage communications, and ensure rapid service restoration.
Provide support for our planning and deployment teams to enable stability, predictability, and scale in our continued growth.
Collaborate with members of the Platform Engineering team to implement and support far-reaching strategic efforts, provide constructive feedback, and foster a collaborative environment.
Work cross-functionally with internal teams and vendors to manage our growth around the globe, with a strong focus on maintaining the high level of performance, availability, and reliability for our users.

Requirements

What you’ll need

5+ years in Site Reliability, Cloud, or DevOps Engineering, ideally in SaaS or large-scale production environments.
Experience designing and deploying large scale systems, multi-vendor platforms and globally distributed infrastructure.
Proven experience managing cloud infrastructure in AWS (multi-account, VPC, EC2, EKS) and Kubernetes at scale.
Strong hands-on experience with IaC and automation (Terraform, Ansible, or similar).
Familiarity with CI/CD pipelines and release automation (GitLab preferred, Jenkins acceptable).
Deep understanding of monitoring and observability using Datadog (or equivalent), including metric design, log pipelines, alerting, and dashboards.
Experience with incident management, on-call participation, escalation, and structured postmortems.
Scripting skills in Python, Bash, Java or equivalent for automation and diagnostics.
Curiosity, ownership, and a bias for action; you see a problem, you solve it, and you share the lessons learned.
Experience with Fedramp (The Federal Risk and Authorization Management Program) compliance is a strong asset.
Basic knowledge of Java- or .Net-based development required.
Strong English communication skills, both written and spoken, are essential for effective correspondence with customers, business partners and colleagues beyond the province of Quebec.
Escalation on-call rotation
Occasional travel (quarterly offsites, conferences – less than 10%)

Benefits

Comp & perks

🌐 Worldwide ❌ Jobs You've Hidden ⭐️ Saved Jobs ✅ Applied Jobs ✉️ Email Alerts 👤 Account Tecsys Inc. Website LinkedIn All Job Openings 501 - 1000 employees Founded 1983 ☁️ SaaS Healthcare
SaaS
Logistics Tecsys Inc. is a leading provider of supply chain management software and services designed to streamline operations in various industries. Known for its expertise in Warehouse Management Systems (WMS), Tecsys serves sectors including healthcare, distribution, 3PL, retail, and e-commerce. The company's Elite and Omni platforms offer comprehensive solutions for inventory management, transportation management, and order fulfillment. With a focus on healthcare supply chain integration, Tecsys helps organizations achieve high efficiency, cost savings, and improved patient care through innovative technology solutions. Site Reliability Engineer 🔥 1 hour ago 🏢🏡 Montreal – Hybrid ⏰ Full Time 🟡 Mid-level 🟠 Senior ⛑ DevOps & Site Reliability Engineer (SRE) Ansible AWS Cloud EC2 Java Jenkins Kubernetes Python Terraform Apply Now Find Hiring Managers Customize resume for this job Report problem ☆ Save ☑️ Mark as applied ❌ Hide 📋 Description
Collaborate with other Engineering teams to support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning and launch reviews.
Innovate relentlessly: Identify pain points, propose creative solutions, and drive initiatives that simplify, scale, and strengthen the platform.
Maintain services once they are live by measuring and monitoring availability, latency and overall system health.
Own observability: Enhance and expand monitoring and alerting using Datadog; define SLOs/SLIs and create actionable dashboards that drive reliability outcomes.
Drive automation: Develop and improve internal tooling, IaC frameworks, and pipelines (Terraform, GitLab CI/CD) to reduce manual intervention and enable self-healing systems.
Scale systems sustainably through automation and evolve systems by pushing for changes that improve reliability and velocity.
Act as an agent orchestrator using Amazon Kiro: run multiple activities in parallel by leveraging AI agents to accelerate execution, while personally validating results and completing selected tasks manually when needed.
Be on-call.
Practice sustainable incident response and blameless postmortems. Lead post-incident reviews (RCAs) and identify long-term fixes that improve stability, reliability, and developer experience.
Implement monitoring, Logging, alerting, and SLA Reporting.
Create and maintain technical documentation.
Implement, maintain and mature SRE best practices.
Lead incidents: Act as Incident Commander for Incidents; coordinate cross-team response, manage communications, and ensure rapid service restoration.
Provide support for our planning and deployment teams to enable stability, predictability, and scale in our continued growth.
Collaborate with members of the Platform Engineering team to implement and support far-reaching strategic efforts, provide constructive feedback, and foster a collaborative environment.
Work cross-functionally with internal teams and vendors to manage our growth around the globe, with a strong focus on maintaining the high level of performance, availability, and reliability for our users. 🎯 Requirements
5+ years in Site Reliability, Cloud, or DevOps Engineering, ideally in SaaS or large-scale production environments.
Experience designing and deploying large scale systems, multi-vendor platforms and globally distributed infrastructure.
Proven experience managing cloud infrastructure in AWS (multi-account, VPC, EC2, EKS) and Kubernetes at scale.
Strong hands-on experience with IaC and automation (Terraform, Ansible, or similar).
Familiarity with CI/CD pipelines and release automation (GitLab preferred, Jenkins acceptable).
Deep understanding of monitoring and observability using Datadog (or equivalent), including metric design, log pipelines, alerting, and dashboards.
Experience with incident management, on-call participation, escalation, and structured postmortems.
Scripting skills in Python, Bash, Java or equivalent for automation and diagnostics.
Curiosity, ownership, and a bias for action; you see a problem, you solve it, and you share the lessons learned.
Experience with Fedramp (The Federal Risk and Authorization Management Program) compliance is a strong asset.
Basic knowledge of Java- or .Net-based development required.
Strong English communication skills, both written and spoken, are essential for effective correspondence with customers, business partners and colleagues beyond the province of Quebec.
Escalation on-call rotation
Occasional travel (quarterly offsites, conferences – less than 10%) Apply Now 📊 Check your resume score for this job Improve your chances of getting an interview by checking your resume score before you apply. Check Resume Score Similar Jobs System Reliability Engineering Specialist 🕒 April 24 Morgan Stanley 10,000+ employees 💸 Finance 💳 Fintech Website LinkedIn All Job Openings SRE Specialist enhancing system service availability and performance for Morgan Stanley's technology. Collaborating with engineering teams and identifying opportunities for automation and reliability improvements in Montreal. 🏢🏡 Montreal – Hybrid ⏰ Full Time 🟡 Mid-level 🟠 Senior ⛑ DevOps & Site Reliability Engineer (SRE) 🗣️🇫🇷 French Required Distributed Systems Docker Grafana Jenkins Kafka Linux MongoDB Perl Prometheus Python Spark Unix Integrator, Azure DevSecOps, Cloud Computing 🕒 April 23 Desjardins 10,000+ employees 🏦 Banking 💸 Finance Website LinkedIn All Job Openings Integrator for Azure DevSecOps working with cloud computing solutions at Desjardins. Focused on improving Azure infrastructure stability and security posture in a hybrid work environment. 🏢🏡 Montreal – Hybrid ⏰ Full Time 🟡 Mid-level 🟠 Senior ⛑ DevOps & Site Reliability Engineer (SRE) 🗣️🇫🇷 French Required Ansible Azure Cloud Linux Terraform Senior DevOps Programmer – Core Technologies 🕒 April 22 Behaviour Interactive 1001 - 5000 🎮 Gaming Website LinkedIn All Job Openings Senior DevOps Programmer developing cloud infrastructure to support Behaviour games. Focus on automation, containerization, and maintaining scalable systems on cloud platforms. 🏢🏡 Montreal – Hybrid 💰 Corporate Round on 2019-07 ⏰ Full Time 🟠 Senior ⛑ DevOps & Site Reliability Engineer (SRE) 🗣️🇫🇷 French Required Ansible AWS Azure Cloud Google Cloud Platform Kubernetes Prometheus Terraform DevOps Engineer 🕒 April 4 Pacific Programming & Tech Inc. 51 - 200 🏢 Enterprise 🤝 B2B 🤖 Artificial Intelligence Website LinkedIn All Job Openings DevOps Engineer at Pacific Programming and Tech Inc. impacting long term strategy and building out team in a collaborative environment. 🏢🏡 Montreal – Hybrid ⏰ Full Time 🟡 Mid-level 🟠 Senior ⛑ DevOps & Site Reliability Engineer (SRE) Azure MS SQL Server SQL .NET DevOps Specialist – Spécialiste DevOps 🕒 March 20 Jesta I.S. 51 - 200 Website LinkedIn All Job Openings DevOps Specialist modernizing CI/CD pipelines for Jesta I.S. focusing on Azure and Oracle. Leading deployment engineering and improving reliability and security in hybrid environments. 🏢🏡 Montreal – Hybrid ⏰ Full Time 🟡 Mid-level 🟠 Senior ⛑ DevOps & Site Reliability Engineer (SRE) Azure Cloud Oracle Python SQL Terraform View More DevOps Jobs 🌐 Worldwide Built by Lior Neu-ner. I'd love to hear your feedback — Get in touch via DM or support@remoterocketship.com Search Search Jobs by country Search jobs by city Search jobs by job title Search entry-level jobs Search junior-level jobs Search senior-level jobs Search jobs by tech stack Search jobs by contract type Search remote internships Search remote part-time jobs Remote jobs Anywhere in the World Companies Hiring Anywhere in the World Companies Hiring Sales People Anywhere in the World Companies Hiring Software Engineers Anywhere in the World Resources Advice Tips for finding remote jobs Interview questions and answers Resume examples Cover letter examples Post a job Affiliates Privacy policy Terms of service Job board SEO course AI Apply Copilot OpenClaw job finder Jobs by Country Remote jobs anywhere in the world (Worldwide remote jobs) Remote jobs United States Remote jobs Australia Remote jobs Brazil Remote jobs Canada Remote jobs France Remote jobs Ireland Remote jobs Germany Remote jobs Netherlands Remote jobs Spain Remote jobs UK Popular Jobs Remote data analyst jobs Remote customer support jobs Remote executive assistant jobs Remote marketing jobs Remote product designer jobs Remote product manager jobs Remote project manager jobs Remote recruiter jobs Remote sales jobs Remote software engineer jobs Jobs by Type Remote full-time jobs Remote part-time jobs Remote contract jobs Remote internship jobs Remote entry-level jobs Remote jobs with no experience required Remote junior jobs (1-3 years of experience) Digital nomad jobs Remote jobs with no degree required Freelance remote jobs Temporary remote jobs Remote jobs hiring now Stay at home mom jobs

ATS Keywords

✓ Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

Site Reliability EngineeringCloud EngineeringDevOps EngineeringInfrastructure as CodeAutomationMonitoringObservabilityScriptingIncident ManagementCompliance

Soft Skills

CuriosityOwnershipBias for ActionProblem SolvingCollaborationCommunication