Site Reliability Engineer, SRE
Denvr
full-time
Posted on:
Location Type: Remote
Location: Remote • 🇨🇦 Canada
Visit company websiteJob Level
Mid-LevelSenior
Tech Stack
AnsibleAWSCloudDNSGoGrafanaKubernetesLinuxPrometheusPythonShell ScriptingTCP/IPTerraform
About the role
- Design, implement, and maintain observability systems with Grafana, Prometheus, Victoria metrics and PromQL to monitor system health and performance.
- Explore opportunities of improving overall observability of HPC environment using industry best practices.
- Participate in on-call rotations, rapidly diagnose and resolve incidents, and perform postmortem reviews to drive continuous improvements.
- Hands-on experience in automating DevOps pipeline using GitHub Action (or similar tools).
Requirements
- 3-5 years in a Site Reliability Engineering (SRE) or DevOps role.
- Strong software development background, Computer science fundamentals.
- Familiarity with tools like Terraform or Helm, Ansible, Python for automated infrastructure provisioning.
- Knowledge of security practices and compliance standards for enterprise environments.
- Familiarity with high-performance computing, specifically in administering GPU-related workloads.
- Strong experience in managing Kubernetes clusters in production environments.
- Expertise observability platforms (Grafana, Prometheus, PromQL) for tracking and analyzing system metrics.
- Solid understanding of networking fundamentals (TCP/IP, DNS, load balancing, VPNs).
- Hands on experience on developing and deploying production grade applications in AWS Cloud under hybrid cloud architecture.
- Proficiency in Linux administration, shell scripting, and performance tuning.
- Strong software development skills (e.g., Bash, Python, Golang) to automate infrastructure and operational tasks.
Benefits
- Competitive salary
- Flexible working hours
- Professional development opportunities
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
GrafanaPrometheusVictoria metricsPromQLGitHub ActionTerraformHelmAnsiblePythonKubernetes
Soft skills
problem-solvingincident resolutioncontinuous improvementcollaboration