
Infrastructure Reliability Engineer
Tecsys Inc.
full-time
Posted on:
Location Type: Remote
Location: Remote • 🇨🇦 Canada
Visit company websiteJob Level
Mid-LevelSenior
Tech Stack
Terraform
About the role
- Collaborate with other engineering teams to support services before they go live through activities such as system design consulting, platform and software framework development, capacity planning, and launch reviews.
- Continuously innovate by identifying weak points, proposing creative solutions, and leading initiatives that simplify, scale, and strengthen the platform.
- Maintain services once they are live by measuring and monitoring availability, latency, and overall system health.
- Ensure optimized observability: improve and expand monitoring and alerting using Datadog; define SLOs/SLIs and create actionable dashboards that drive reliability outcomes.
- Develop and promote automation: enhance internal tools, IaC frameworks, and pipelines (Terraform, GitLab CI/CD) to reduce manual intervention and enable self-healing systems.
- Scale systems sustainably through automation and by fostering changes that improve reliability and velocity.
- Practice blameless incident management and post-incident analysis. Lead post-incident reviews (RCA) and identify long-term fixes that improve stability, reliability, and the developer experience.
- Implement monitoring, logging, alerting, and SLA reporting.
- Create and maintain technical documentation.
- Implement, maintain, and evolve SRE best practices.
- Act as incident commander during incidents: coordinate cross-team response, manage communications, and ensure rapid service restoration.
Requirements
- Participate in on-call rotation for incident escalation
- Occasional travel (quarterly on-site visits, conferences - less than 10%)
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
system design consultingplatform developmentsoftware framework developmentcapacity planningmonitoringalertingautomationIaC frameworksTerraformGitLab CI/CD
Soft skills
collaborationinnovationleadershipincident managementcommunicationproblem-solvingblameless post-incident analysiscross-team coordination