Site Reliability Engineer, SRE

Denvr

full-time

Posted on: 12/25/2025

Location Type: Remote

Location: Canada

Visit company website

Explore more

DevOps Engineer jobs

✨ AI Apply

Job Level

Mid-Level Senior

Tech Stack

Ansible AWS Cloud DNS Go Grafana Kubernetes Linux Prometheus Python Shell Scripting TCP/IP Terraform

About the role

Design, implement, and maintain observability systems with Grafana, Prometheus, Victoria metrics and PromQL to monitor system health and performance.
Explore opportunities of improving overall observability of HPC environment using industry best practices.
Participate in on-call rotations, rapidly diagnose and resolve incidents, and perform postmortem reviews to drive continuous improvements.
Hands-on experience in automating DevOps pipeline using GitHub Action (or similar tools).

Requirements

3-5 years in a Site Reliability Engineering (SRE) or DevOps role.
Strong software development background, Computer science fundamentals.
Familiarity with tools like Terraform or Helm, Ansible, Python for automated infrastructure provisioning.
Knowledge of security practices and compliance standards for enterprise environments.
Familiarity with high-performance computing, specifically in administering GPU-related workloads.
Strong experience in managing Kubernetes clusters in production environments.
Expertise observability platforms (Grafana, Prometheus, PromQL) for tracking and analyzing system metrics.
Solid understanding of networking fundamentals (TCP/IP, DNS, load balancing, VPNs).
Hands on experience on developing and deploying production grade applications in AWS Cloud under hybrid cloud architecture.
Proficiency in Linux administration, shell scripting, and performance tuning.
Strong software development skills (e.g., Bash, Python, Golang) to automate infrastructure and operational tasks.

Benefits

Competitive salary
Flexible working hours
Professional development opportunities

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

GrafanaPrometheusVictoria metricsPromQLGitHub ActionTerraformHelmAnsiblePythonKubernetes

Soft Skills

problem-solvingincident resolutioncontinuous improvementcollaboration