Truelogic Software

Site Reliability Engineer, AWS

Truelogic Software

full-time

Posted on:

Location Type: Remote

Location: Remote • 🇲🇽 Mexico

Visit company website
AI Apply
Apply

Job Level

Mid-LevelSenior

Tech Stack

AWSGrafanaKafkaKubernetesNode.jsPrometheusPythonSpark

About the role

  • Designs, implements, and evolves shared AWS CDK and CDK8s constructs used across multiple services and teams.
  • Maintains core infrastructure components including VPC, EKS clusters and node groups, RDS, OpenSearch, and MSK.
  • Operates and extends Kubernetes cluster addons such as ingress controllers, cert-manager, autoscalers, and monitoring/logging stacks.
  • Ensures high reliability through structured alerting systems (Prometheus, CloudWatch), autoscaling strategies, and recovery mechanisms.
  • Manages and publishes baseline templates, configuration schemas, and comprehensive documentation for infrastructure usage.
  • Owns the CI/CD pipelines for Infrastructure as Code (IaC) codebases and platform component releases.
  • Collaborates with engineering teams to troubleshoot infrastructure-related issues and deliver scalable, reliable solutions.
  • Applies Site Reliability Engineering (SRE) principles—including SLIs, SLOs, observability, and fault tolerance—to all shared platform services.
  • Supports IAM roles, secrets management, and tenant isolation best practices.

Requirements

  • Has 5+ years of experience in infrastructure or Site Reliability Engineering (SRE), including hands-on work with AWS services such as VPC, IAM, RDS, MSK, and S3, as well as Kubernetes components like Helm, RBAC, and ServiceAccounts.
  • Demonstrates fluency in Python and has practical experience with Infrastructure-as-Code using AWS CDK, CDK8s, or equivalent frameworks such as Pulumi.
  • Possesses a strong understanding of Prometheus, Grafana, and effective alert routing practices.
  • Has experience designing reusable infrastructure patterns or building internal developer platforms.
  • Shows a proven track record of improving system reliability through automation, monitoring, and operational best practices.
  • Has experience supporting Spark on Kubernetes, Argo, or Kafka-based batch pipelines.
Benefits
  • 100% Remote Work: Enjoy the freedom to work from the location that helps you thrive. All it takes is a laptop and a reliable internet connection.
  • Highly Competitive USD Pay: Earn an excellent, market-leading compensation in USD, that goes beyond typical market offerings.
  • Paid Time Off: We value your well-being. Our paid time off policies ensure you have the chance to unwind and recharge when needed.
  • Work with Autonomy: Enjoy the freedom to manage your time as long as the work gets done. Focus on results, not the clock.
  • Work with Top American Companies: Grow your expertise working on innovative, high-impact projects with Industry-Leading U.S. Companies.

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills
AWS CDKCDK8sKubernetesPythonInfrastructure-as-CodePrometheusGrafanaIAMRDSMSK
Soft skills
collaborationtroubleshootingreliability improvementdocumentation