Site Reliability Engineer, AWS

Truelogic Software

full-time

Posted on: 11/4/2025

Location Type: Remote

Location: Remote • 🇲🇽 Mexico

✨ AI Apply

Mid-LevelSenior

AWSGrafanaKafkaKubernetesNode.jsPrometheusPythonSpark

About the role

Designs, implements, and evolves shared AWS CDK and CDK8s constructs used across multiple services and teams.
Maintains core infrastructure components including VPC, EKS clusters and node groups, RDS, OpenSearch, and MSK.
Operates and extends Kubernetes cluster addons such as ingress controllers, cert-manager, autoscalers, and monitoring/logging stacks.
Ensures high reliability through structured alerting systems (Prometheus, CloudWatch), autoscaling strategies, and recovery mechanisms.
Manages and publishes baseline templates, configuration schemas, and comprehensive documentation for infrastructure usage.
Owns the CI/CD pipelines for Infrastructure as Code (IaC) codebases and platform component releases.
Collaborates with engineering teams to troubleshoot infrastructure-related issues and deliver scalable, reliable solutions.
Applies Site Reliability Engineering (SRE) principles—including SLIs, SLOs, observability, and fault tolerance—to all shared platform services.
Supports IAM roles, secrets management, and tenant isolation best practices.

Has 5+ years of experience in infrastructure or Site Reliability Engineering (SRE), including hands-on work with AWS services such as VPC, IAM, RDS, MSK, and S3, as well as Kubernetes components like Helm, RBAC, and ServiceAccounts.
Demonstrates fluency in Python and has practical experience with Infrastructure-as-Code using AWS CDK, CDK8s, or equivalent frameworks such as Pulumi.
Possesses a strong understanding of Prometheus, Grafana, and effective alert routing practices.
Has experience designing reusable infrastructure patterns or building internal developer platforms.
Shows a proven track record of improving system reliability through automation, monitoring, and operational best practices.
Has experience supporting Spark on Kubernetes, Argo, or Kafka-based batch pipelines.

Benefits

100% Remote Work: Enjoy the freedom to work from the location that helps you thrive. All it takes is a laptop and a reliable internet connection.
Highly Competitive USD Pay: Earn an excellent, market-leading compensation in USD, that goes beyond typical market offerings.
Paid Time Off: We value your well-being. Our paid time off policies ensure you have the chance to unwind and recharge when needed.
Work with Autonomy: Enjoy the freedom to manage your time as long as the work gets done. Focus on results, not the clock.
Work with Top American Companies: Grow your expertise working on innovative, high-impact projects with Industry-Leading U.S. Companies.

Tip: use these terms in your resume and cover letter to boost ATS matches.

AWS CDKCDK8sKubernetesPythonInfrastructure-as-CodePrometheusGrafanaIAMRDSMSK

collaborationtroubleshootingreliability improvementdocumentation