Tecsys Inc.

Site Reliability Engineer

Tecsys Inc.

full-time

Posted on:

Location Type: Remote

Location: Remote • 🇨🇦 Canada

Visit company website
AI Apply
Apply

Job Level

Mid-LevelSenior

Tech Stack

AnsibleAWSCloudEC2JavaJenkinsKubernetesPythonTerraform

About the role

  • Collaborate with other Engineering teams to support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning and launch reviews.
  • Innovate relentlessly: Identify pain points, propose creative solutions, and drive initiatives that simplify, scale, and strengthen the platform.
  • Maintain services once they are live by measuring and monitoring availability, latency and overall system health.
  • Own observability: Enhance and expand monitoring and alerting using Datadog; define SLOs/SLIs and create actionable dashboards that drive reliability outcomes.
  • Drive automation: Develop and improve internal tooling, IaC frameworks, and pipelines (Terraform, GitLab CI/CD) to reduce manual intervention and enable self-healing systems.
  • Scale systems sustainably through automation and evolve systems by pushing for changes that improve reliability and velocity.
  • Be on-call.
  • Practice sustainable incident response and blameless postmortems. Lead post-incident reviews (RCAs) and identify long-term fixes that improve stability, reliability, and developer experience.
  • Implement monitoring, Logging, alerting, and SLA Reporting.
  • Create and maintain technical documentation.
  • Implement, maintain and mature SRE best practices.
  • Lead incidents: Act as Incident Commander for Incidents; coordinate cross-team response, manage communications, and ensure rapid service restoration.
  • Provide support for our planning and deployment teams to enable stability, predictability, and scale in our continued growth.
  • Collaborate with members of the Platform Engineering team to implement and support far-reaching strategic efforts, provide constructive feedback, and foster a collaborative environment.
  • Work cross-functionally with internal teams and vendors to manage our growth around the globe, with a strong focus on maintaining the high level of performance, availability, and reliability for our users.

Requirements

  • 5+ years in Site Reliability, Cloud, or DevOps Engineering, ideally in SaaS or large-scale production environments.
  • Experience designing and deploying large scale systems, multi-vendor platforms and globally distributed infrastructure.
  • Proven experience managing cloud infrastructure in AWS (multi-account, VPC, EC2, EKS) and Kubernetes at scale.
  • Strong hands-on experience with IaC and automation (Terraform, Ansible, or similar).
  • Familiarity with CI/CD pipelines and release automation (GitLab preferred, Jenkins acceptable).
  • Deep understanding of monitoring and observability using Datadog (or equivalent), including metric design, log pipelines, alerting, and dashboards.
  • Experience with incident management, on-call participation, escalation, and structured postmortems.
  • Scripting skills in Python, Bash, Java or equivalent for automation and diagnostics.
  • Curiosity, ownership, and a bias for action; you see a problem, you solve it, and you share the lessons learned.
  • Experience with Fedramp (The Federal Risk and Authorization Management Program) compliance is a strong asset.
  • Basic knowledge of Java- or .Net-based development required.
  • Strong English communication skills, both written and spoken, are essential for effective correspondence with customers, business partners and colleagues beyond the province of Quebec.
  • **Additional requirements:**
  • Escalation on-call rotation
  • Occasional travel (quarterly offsites, conferences – less than 10%)

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills
Site Reliability EngineeringCloud EngineeringDevOps EngineeringInfrastructure as CodeAutomationMonitoringObservabilityScriptingIncident ManagementCompliance
Soft skills
CuriosityOwnershipBias for ActionProblem SolvingCollaborationCommunication
Brinqa

Senior DevOps Engineer

Brinqa
Seniorfull-time🇨🇦 Canada
Posted: 2 days agoSource: boards.greenhouse.io
AnsibleCloudDockerFirewallsJavaScriptJenkinsKubernetesLinuxPythonTerraform
BrightOrder Inc.

DevOps Developer, AWS

BrightOrder Inc.
Mid · Seniorfull-time🇨🇦 Canada
Posted: 2 days agoSource: apply.workable.com
AWSCloudDistributed SystemsDockerEC2KubernetesLinuxMicroservicesPostgresPythonRabbitMQRedis
Cerebras Systems

Senior Deployment Engineer, AI Inference

Cerebras Systems
Seniorfull-time🇨🇦 Canada
Posted: 2 days agoSource: boards.greenhouse.io
AWSDockerGrafanaKubernetesLinuxPrometheusPython
S&P Global

Senior Site Reliability Engineer

S&P Global
Junior · Midfull-time🇨🇦 Canada
Posted: 5 days agoSource: spgi.wd5.myworkdayjobs.com
AnsibleApacheAWSChefCloudDockerEC2GrafanaJ2EEJenkinsKubernetesLinux+8 more