
Site Reliability Engineer, AWS
Truelogic Software
full-time
Posted on:
Location Type: Remote
Location: Remote • 🇲🇽 Mexico
Visit company websiteJob Level
Mid-LevelSenior
Tech Stack
AWSGrafanaKafkaKubernetesNode.jsPrometheusPythonSpark
About the role
- Designs, implements, and evolves shared AWS CDK and CDK8s constructs used across multiple services and teams.
- Maintains core infrastructure components including VPC, EKS clusters and node groups, RDS, OpenSearch, and MSK.
- Operates and extends Kubernetes cluster addons such as ingress controllers, cert-manager, autoscalers, and monitoring/logging stacks.
- Ensures high reliability through structured alerting systems (Prometheus, CloudWatch), autoscaling strategies, and recovery mechanisms.
- Manages and publishes baseline templates, configuration schemas, and comprehensive documentation for infrastructure usage.
- Owns the CI/CD pipelines for Infrastructure as Code (IaC) codebases and platform component releases.
- Collaborates with engineering teams to troubleshoot infrastructure-related issues and deliver scalable, reliable solutions.
- Applies Site Reliability Engineering (SRE) principles—including SLIs, SLOs, observability, and fault tolerance—to all shared platform services.
- Supports IAM roles, secrets management, and tenant isolation best practices.
Requirements
- Has 5+ years of experience in infrastructure or Site Reliability Engineering (SRE), including hands-on work with AWS services such as VPC, IAM, RDS, MSK, and S3, as well as Kubernetes components like Helm, RBAC, and ServiceAccounts.
- Demonstrates fluency in Python and has practical experience with Infrastructure-as-Code using AWS CDK, CDK8s, or equivalent frameworks such as Pulumi.
- Possesses a strong understanding of Prometheus, Grafana, and effective alert routing practices.
- Has experience designing reusable infrastructure patterns or building internal developer platforms.
- Shows a proven track record of improving system reliability through automation, monitoring, and operational best practices.
- Has experience supporting Spark on Kubernetes, Argo, or Kafka-based batch pipelines.
Benefits
- 100% Remote Work: Enjoy the freedom to work from the location that helps you thrive. All it takes is a laptop and a reliable internet connection.
- Highly Competitive USD Pay: Earn an excellent, market-leading compensation in USD, that goes beyond typical market offerings.
- Paid Time Off: We value your well-being. Our paid time off policies ensure you have the chance to unwind and recharge when needed.
- Work with Autonomy: Enjoy the freedom to manage your time as long as the work gets done. Focus on results, not the clock.
- Work with Top American Companies: Grow your expertise working on innovative, high-impact projects with Industry-Leading U.S. Companies.
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
AWS CDKCDK8sKubernetesPythonInfrastructure-as-CodePrometheusGrafanaIAMRDSMSK
Soft skills
collaborationtroubleshootingreliability improvementdocumentation