Cloud Reliability Engineer

Infios

full-time

Posted on: 12/27/2025

Location Type: Remote

Location: Brazil

✨ AI Apply

About the role

Operate, maintain, and improve cloud infrastructure in AWS, Azure, or GCP environments.
Manage and optimize Kubernetes clusters — deployment, scaling, patching, and upgrades.
Ensure system availability, scalability, and performance through proactive monitoring and optimization.
Maintain infrastructure-as-code (IaC) for consistent and repeatable deployments.
Identify opportunities for operational automation to eliminate manual processes (“reduce toil”).
Build and maintain automated pipelines for deployments, configuration, and remediation.
Develop self-healing mechanisms to automatically detect and resolve common service issues.
Design proactive monitoring, alerting, and observability dashboards (Dynatrace, DataDog).
Collaborate with DevOps and development teams to build reliable, observable, and resilient systems.
Monitor, troubleshoot, and resolve infrastructure and application issues.

Bachelor’s degree in computer science, Engineering, or related field (or equivalent experience).
5+ years of experience in experience in Cloud Engineering, DevOps, or Site Reliability roles.
Hands-on experience with cloud platforms (OCI, AWS, Azure, or GCP).
Strong knowledge of Kubernetes deployment, management, and troubleshooting.
Solid understanding of observability and monitoring (e.g., Dynatrace, DataDog) and incident management platforms.
Proficiency in scripting and automation (e.g., Python, Bash, Terraform, Ansible).
Strong troubleshooting and analytical skills across infrastructure and applications.
Experience with incident response, RCA, and postmortem processes.
A mindset of continuous improvement, reliability, and self-healing automation.
Understanding of SRE principles, SLAs/SLOs/SLIs, and chaos engineering practices.

Benefits

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

cloud infrastructureKubernetesinfrastructure-as-codeautomationscriptingPythonBashTerraformAnsibleobservability

Soft Skills

troubleshootinganalytical skillscontinuous improvementcollaborationreliabilityself-healing automation