
Senior Site Reliability Engineer
Visa
full-time
Posted on:
Location Type: Remote
Location: Brazil
Visit company websiteExplore more
Job Level
About the role
- Own the end‑to‑end lifecycle (design, provisioning, upgrades, maintenance, and decommissioning) of core platform components, including: Cloud infrastructure primitives Kubernetes clusters and cluster services Networking, ingress, and service discovery Service Mesh and supporting data‑plane components
- Design platform components to be resilient by default, applying SRE principles such as: Fault isolation and graceful degradation Capacity planning and saturation control Reduced operational toil and clear failure modes
- Lead the design and implementation of infrastructure bootstrap orchestration, including: Automated cluster and environment provisioning Deterministic, repeatable platform bring‑up and teardown Dependency‑aware orchestration across cloud, network, and Kubernetes layers
- Drive Infrastructure‑as‑Code and GitOps‑first practices to ensure: Platform components are reproducible and auditable Changes are automated, testable, and reversible Manual intervention is minimized or eliminated
- Identify automation gaps and lead initiatives that reduce human effort, onboarding time, and operational risk. Apply and promote SRE operational excellence practices, including: Clear ownership and runbooks for platform components Participation in on‑call rotation as a platform reliability escalation point Incident response, post‑incident reviews, and problem management
- Improve day‑2 operations by standardizing upgrade/rollback strategies and reducing MTTD/MTTR. Ensure platform operations align with security, compliance, and internal control requirements.
- Collaborate with engineering teams across the organization to influence platform adoption, reliability standards, and cloud‑native best practices.
Requirements
- Proficiency in English at B2 level or above (Upper-Intermediate)
- Strong hands‑on experience with public cloud platforms (AWS preferred, Azure also considered)
- Proven experience operating and administering Kubernetes at scale in production environments
- Strong experience with container orchestration platforms and cloud architecture fundamentals (networking, IAM/security concepts, and reliability patterns)
- Experience with Infrastructure as Code (Terraform preferred) and automation‑first workflows
- Familiarity with GitOps practices and CI/CD pipelines
- Strong troubleshooting skills for distributed systems, including root‑cause analysis and reliability improvements
- Experience with observability concepts and practices (monitoring, logging, alerting, tracing)
- Experience with Service Mesh technologies (Istio preferred, App Mesh or Linkerd)
- Experience working with critical or mission‑critical systems
- Strong background applying SRE principles (operational readiness, incident management, runbooks, toil reduction)
- AWS certifications.
Benefits
- Remote work options
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
KubernetesInfrastructure as CodeTerraformGitOpsCI/CDService MeshIstioAWSAzureobservability
Soft Skills
troubleshootingroot-cause analysiscollaborationleadershipincident managementoperational readinessproblem managementcapacity planningcommunicationreliability improvements
Certifications
AWS certifications