Tecsys Inc.

Infrastructure Reliability Engineer

Tecsys Inc.

full-time

Posted on:

Location Type: Remote

Location: Remote • 🇨🇦 Canada

Visit company website
AI Apply
Apply

Job Level

Mid-LevelSenior

Tech Stack

Terraform

About the role

  • Collaborate with other engineering teams to support services before they go live through activities such as system design consulting, platform and software framework development, capacity planning, and launch reviews.
  • Continuously innovate by identifying weak points, proposing creative solutions, and leading initiatives that simplify, scale, and strengthen the platform.
  • Maintain services once they are live by measuring and monitoring availability, latency, and overall system health.
  • Ensure optimized observability: improve and expand monitoring and alerting using Datadog; define SLOs/SLIs and create actionable dashboards that drive reliability outcomes.
  • Develop and promote automation: enhance internal tools, IaC frameworks, and pipelines (Terraform, GitLab CI/CD) to reduce manual intervention and enable self-healing systems.
  • Scale systems sustainably through automation and by fostering changes that improve reliability and velocity.
  • Practice blameless incident management and post-incident analysis. Lead post-incident reviews (RCA) and identify long-term fixes that improve stability, reliability, and the developer experience.
  • Implement monitoring, logging, alerting, and SLA reporting.
  • Create and maintain technical documentation.
  • Implement, maintain, and evolve SRE best practices.
  • Act as incident commander during incidents: coordinate cross-team response, manage communications, and ensure rapid service restoration.

Requirements

  • Participate in on-call rotation for incident escalation
  • Occasional travel (quarterly on-site visits, conferences - less than 10%)

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills
system design consultingplatform developmentsoftware framework developmentcapacity planningmonitoringalertingautomationIaC frameworksTerraformGitLab CI/CD
Soft skills
collaborationinnovationleadershipincident managementcommunicationproblem-solvingblameless post-incident analysiscross-team coordination