System Reliability Engineer – DevOps

Growe Talents

full-time

Posted on: 3/17/2026

Location Type: Remote

✨ AI Apply

About the role

Ensure availability, performance, and scalability of infrastructure and services through monitoring, automation, and operational best practices;
Lead incident response, perform root cause analysis, and implement recovery and long-term fixes;
Manage infrastructure using Terraform, Terragrunt, and automation tools for consistency and repeatability;
Implement and maintain metrics, logs, and tracing solutions (Prometheus, Grafana, Loki, VictoriaMetrics, CloudWatch) to ensure system visibility;
Identify bottlenecks, tune systems, and improve infrastructure performance;
Monitor resources, forecast growth, and implement scaling strategies;
Integrate security best practices into IaC, CI/CD pipelines, and deployments;
Support vulnerability management;
Participate in 24/7 rotations (once a week) for timely resolution of critical incidents;
Work with DevOps, PRE, development, and security teams to improve reliability and design resilient systems;
Maintain operational runbooks, incident reports, and system documentation.

3+ years in a DevOps, SRE, or related role;
Strong hands-on experience with AWS services including EC2, ECS, EKS, RDS, DocumentDB, ElastiCache, Keyspaces, S3, EBS, VPC, Route53, KMS, ACM, and CloudWatch;
Proficiency with Terraform, Terragrunt, and Atlantis for reproducible and version-controlled infrastructure;
Experience with GitLab CI, FluxCD, Argo Rollouts, and automation tools (Ansible, Python, Bash);
Solid experience with Docker, Kubernetes (AWS EKS), and Helm (including custom templates, ChartMuseum);
Familiarity with cluster add-ons such as KEDA, VPA, Karpenter, External-DNS, ingress-nginx, aws-alb-controller, and ebs-csi-driver;
Experience with Grafana, VictoriaMetrics stack, Tempo, metrics exporters, Pingdom, AWS CloudWatch, and alerting systems like PagerDuty, VMAlert, and Alertmanager;
Proficiency with OpenSearch, and Vector Agent for centralized logging;
Strong understanding of networking concepts, AWS networking (VPC, Network Firewall, Transit Gateway, Site-to-Site VPN), identity and access management, certificate management (ACM, Vault, SOPS), and application security best practices;
Familiarity with Cloudflare services, including caching, DNS, and Workers;
Exposure to AWS Cost Explorer, KubeCost, and custom cost export tools;
Certifications: AWS, Terraform, Kubernetes, or Helm are a plus.

Benefits

Health & Wellness Focus;
Global Medical Coverage;
Growth Opportunities;
Benefits Programs (compensation for the gym/stomatology/psychological service & etc.);
Performance-Driven Rewards;
Dynamic Work Environment.

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

TerraformTerragruntAWSDockerKubernetesGitLab CIAnsiblePythonBashOpenSearch

Soft Skills

incident responseroot cause analysisproblem-solvingcollaborationcommunication

Certifications

AWSTerraformKubernetesHelm