Truelogic Software

Senior Reliability Engineer – Technology

Truelogic Software

full-time

Posted on:

Location Type: Remote

Location: Remote • 💃 Anywhere in Latin America

Visit company website
AI Apply
Apply

Job Level

Senior

Tech Stack

AWSGrafanaKafkaKubernetesPrometheusPythonSpark

About the role

  • Designs, implements, and continuously improves observability strategies across services, including metrics, logs, traces, alerts, and dashboards.
  • Focuses on understanding system behavior in production, identifying failure modes, performance bottlenecks, and reliability risks.
  • Evolves and maintains shared AWS CDK and CDK8s constructs, with emphasis on observability, autoscaling, and operational safeguards rather than basic infrastructure provisioning.
  • Maintains and operates core platform components such as VPC, EKS clusters, RDS, OpenSearch, and MSK, ensuring they expose meaningful operational signals.
  • Operates and enhances Kubernetes cluster addons such as ingress controllers, cert-manager, autoscalers, and monitoring, logging, and tracing stacks.
  • Defines and maintains SLIs, SLOs, and alerting strategies that clearly distinguish between symptoms, root causes, and actionable operational events.
  • Improves automated operational responses, including autoscaling, self-healing mechanisms, and runbook-driven remediation.
  • Ensures high reliability through structured alerting systems (Prometheus, CloudWatch), noise reduction, alert quality improvements, and recovery mechanisms.
  • Collaborates with engineering teams to investigate production incidents, perform root cause analysis, and drive long-term reliability improvements.
  • Owns CI/CD pipelines for Infrastructure as Code (IaC) and observability-related platform components.
  • Applies Site Reliability Engineering (SRE) principles—including observability-first design, error budgets, and operational readiness—to shared platform services.
  • Supports IAM roles, secrets management, and tenant isolation best practices.

Requirements

  • Has 5+ years of experience in Site Reliability Engineering, Platform Engineering, or Infrastructure roles, with significant hands-on experience operating and supporting production systems.
  • Demonstrates strong experience in observability operations, including defining metrics, logs, traces, dashboards, alerts, and reliability indicators for complex systems.
  • Has hands-on experience with AWS services such as VPC, IAM, RDS, MSK, S3, and CloudWatch, as well as Kubernetes components like Helm, RBAC, and ServiceAccounts.
  • Demonstrates fluency in Python and experience with Infrastructure-as-Code using AWS CDK, CDK8s, or equivalent frameworks.
  • Possesses a strong understanding of Prometheus, Grafana, alert tuning, alert fatigue reduction, and incident-driven monitoring improvements.
  • Has experience improving existing systems rather than building greenfield infrastructure, with a focus on operational excellence and system reliability.
  • Shows a proven track record of using observability data to drive automation, scaling decisions, and operational improvements.
  • Has experience designing reusable infrastructure or observability patterns, or contributing to internal developer or platform tooling.
  • Has experience supporting Spark on Kubernetes, Argo, or Kafka-based batch pipelines (nice to have).
Benefits
  • 100% Remote Work: Enjoy the freedom to work from the location that helps you thrive. All it takes is a laptop and a reliable internet connection.
  • Highly Competitive USD Pay: Earn an excellent, market-leading compensation in USD, that goes beyond typical market offerings.
  • Paid Time Off: We value your well-being. Our paid time off policies ensure you have the chance to unwind and recharge when needed.
  • Work with Autonomy: Enjoy the freedom to manage your time as long as the work gets done. Focus on results, not the clock.
  • Work with Top American Companies: Grow your expertise working on innovative, high-impact projects with Industry-Leading U.S. Companies.

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills
Site Reliability EngineeringPlatform EngineeringInfrastructure rolesobservability operationsAWS CDKKubernetesPythonInfrastructure-as-CodePrometheusGrafana
Soft skills
collaborationroot cause analysisoperational excellenceincident managementautomation