
Senior Reliability Engineer – Technology
Truelogic Software
full-time
Posted on:
Location Type: Remote
Location: Remote • 🇧🇷 Brazil
Visit company websiteJob Level
Senior
Tech Stack
AWSGrafanaKafkaKubernetesPrometheusPythonSpark
About the role
- Designs, implements, and continuously improves observability strategies across services, including metrics, logs, traces, alerts, and dashboards.
- Focuses on understanding system behavior in production, identifying failure modes, performance bottlenecks, and reliability risks.
- Evolves and maintains shared AWS CDK and CDK8s constructs, with emphasis on observability, autoscaling, and operational safeguards rather than basic infrastructure provisioning.
- Maintains and operates core platform components such as VPC, EKS clusters, RDS, OpenSearch, and MSK, ensuring they expose meaningful operational signals.
- Operates and enhances Kubernetes cluster addons such as ingress controllers, cert-manager, autoscalers, and monitoring, logging, and tracing stacks.
- Defines and maintains SLIs, SLOs, and alerting strategies that clearly distinguish between symptoms, root causes, and actionable operational events.
- Improves automated operational responses, including autoscaling, self-healing mechanisms, and runbook-driven remediation.
- Ensures high reliability through structured alerting systems (Prometheus, CloudWatch), noise reduction, alert quality improvements, and recovery mechanisms.
- Collaborates with engineering teams to investigate production incidents, perform root cause analysis, and drive long-term reliability improvements.
- Owns CI/CD pipelines for Infrastructure as Code (IaC) and observability-related platform components.
- Applies Site Reliability Engineering (SRE) principles—including observability-first design, error budgets, and operational readiness—to shared platform services.
- Supports IAM roles, secrets management, and tenant isolation best practices.
Requirements
- Has 5+ years of experience in Site Reliability Engineering, Platform Engineering, or Infrastructure roles, with significant hands-on experience operating and supporting production systems.
- Demonstrates strong experience in observability operations, including defining metrics, logs, traces, dashboards, alerts, and reliability indicators for complex systems.
- Has hands-on experience with AWS services such as VPC, IAM, RDS, MSK, S3, and CloudWatch, as well as Kubernetes components like Helm, RBAC, and ServiceAccounts.
- Demonstrates fluency in Python and experience with Infrastructure-as-Code using AWS CDK, CDK8s, or equivalent frameworks.
- Possesses a strong understanding of Prometheus, Grafana, alert tuning, alert fatigue reduction, and incident-driven monitoring improvements.
- Has experience improving existing systems rather than building greenfield infrastructure, with a focus on operational excellence and system reliability.
- Shows a proven track record of using observability data to drive automation, scaling decisions, and operational improvements.
- Has experience designing reusable infrastructure or observability patterns, or contributing to internal developer or platform tooling.
- Has experience supporting Spark on Kubernetes, Argo, or Kafka-based batch pipelines (nice to have).
Benefits
- 100% Remote Work: Enjoy the freedom to work from the location that helps you thrive. All it takes is a laptop and a reliable internet connection.
- Highly Competitive USD Pay: Earn an excellent, market-leading compensation in USD, that goes beyond typical market offerings.
- Paid Time Off: We value your well-being. Our paid time off policies ensure you have the chance to unwind and recharge when needed.
- Work with Autonomy: Enjoy the freedom to manage your time as long as the work gets done. Focus on results, not the clock.
- Work with Top American Companies: Grow your expertise working on innovative, high-impact projects with Industry-Leading U.S. Companies.
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
Site Reliability EngineeringPlatform EngineeringInfrastructure rolesobservability operationsInfrastructure-as-CodePythonalert tuningincident-driven monitoringKubernetesAWS CDK
Soft skills
collaborationroot cause analysisoperational excellenceautomationscaling decisionsreliability improvementscommunication