Visa

Senior Site Reliability Engineer

Visa

full-time

Posted on:

Location Type: Remote

Location: Brazil

Visit company website

Explore more

AI Apply
Apply

Job Level

About the role

  • Own the end‑to‑end lifecycle (design, provisioning, upgrades, maintenance, and decommissioning) of core platform components, including: Cloud infrastructure primitives Kubernetes clusters and cluster services Networking, ingress, and service discovery Service Mesh and supporting data‑plane components
  • Design platform components to be resilient by default, applying SRE principles such as: Fault isolation and graceful degradation Capacity planning and saturation control Reduced operational toil and clear failure modes
  • Lead the design and implementation of infrastructure bootstrap orchestration, including: Automated cluster and environment provisioning Deterministic, repeatable platform bring‑up and teardown Dependency‑aware orchestration across cloud, network, and Kubernetes layers
  • Drive Infrastructure‑as‑Code and GitOps‑first practices to ensure: Platform components are reproducible and auditable Changes are automated, testable, and reversible Manual intervention is minimized or eliminated
  • Identify automation gaps and lead initiatives that reduce human effort, onboarding time, and operational risk. Apply and promote SRE operational excellence practices, including: Clear ownership and runbooks for platform components Participation in on‑call rotation as a platform reliability escalation point Incident response, post‑incident reviews, and problem management
  • Improve day‑2 operations by standardizing upgrade/rollback strategies and reducing MTTD/MTTR. Ensure platform operations align with security, compliance, and internal control requirements.
  • Collaborate with engineering teams across the organization to influence platform adoption, reliability standards, and cloud‑native best practices.

Requirements

  • Proficiency in English at B2 level or above (Upper-Intermediate)
  • Strong hands‑on experience with public cloud platforms (AWS preferred, Azure also considered)
  • Proven experience operating and administering Kubernetes at scale in production environments
  • Strong experience with container orchestration platforms and cloud architecture fundamentals (networking, IAM/security concepts, and reliability patterns)
  • Experience with Infrastructure as Code (Terraform preferred) and automation‑first workflows
  • Familiarity with GitOps practices and CI/CD pipelines
  • Strong troubleshooting skills for distributed systems, including root‑cause analysis and reliability improvements
  • Experience with observability concepts and practices (monitoring, logging, alerting, tracing)
  • Experience with Service Mesh technologies (Istio preferred, App Mesh or Linkerd)
  • Experience working with critical or mission‑critical systems
  • Strong background applying SRE principles (operational readiness, incident management, runbooks, toil reduction)
  • AWS certifications.
Benefits
  • Remote work options
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
KubernetesInfrastructure as CodeTerraformGitOpsCI/CDService MeshIstioAWSAzureobservability
Soft Skills
troubleshootingroot-cause analysiscollaborationleadershipincident managementoperational readinessproblem managementcapacity planningcommunicationreliability improvements
Certifications
AWS certifications