
Manager, Site Reliability Engineer – Platform
Visa
full-time
Posted on:
Location Type: Remote
Location: Brazil
Visit company websiteExplore more
About the role
- Act as the technical owner of the Platform Squad, defining, driving, and enforcing platform standards across the full lifecycle (design, rollout, upgrades, and decommissioning) for: Cloud infrastructure, Kubernetes, Service Mesh
- Ensure platform components are designed and operated according to SRE principles, focusing on reliability, scalability, and operational simplicity
- Drive architectural decisions with a sustainable platform vision, balancing innovation, security, and operational stability
- Define, build, and continuously improve operational processes for internal and external consumers, including: Platform onboarding and adoption, Change management and release processes, Incident, problem, and escalation management
- Act as a point of escalation for complex platform incidents and reliability risks, participating in on-call rotations as needed
- Ensure platform operations comply with internal controls, audit requirements, and security standards
- Establish and own platform observability standards, ensuring consistent implementation of Golden Signals: Latency, Traffic, Errors, Saturation
- Define and track platform SLIs, SLOs, and error budgets in partnership with internal consumers
- Use metrics and operational data to drive prioritization, reliability improvements, and capacity planning decisions
- Foster a collaborative, servant-leadership culture that enables squads to self-serve while maintaining guardrails
- Collaborate closely with application engineering teams, other SRE squads, and stakeholders across security, compliance, and architecture
- Promote knowledge sharing through strong documentation and enablement around platform usage and best practices
- Provide technical mentorship and guidance to platform engineers, supporting engineering excellence and growth
- Support the Squad Manager in planning, prioritization, and execution of platform initiatives
- Ensure work is visible, well-documented, and aligned with broader SRE and company objectives
Requirements
- 5+ years of relevant work experience with a Bachelor’s Degree
- Proven experience in Platform Engineering and/or SRE roles, with demonstrated technical leadership
- Strong hands-on experience with public cloud platforms (AWS preferred; Azure is a plus)
- Strong experience operating Kubernetes at scale (EKS or equivalent)
- Experience with Service Mesh technologies (Istio preferred; App Mesh, Linkerd, etc. are a plus)
- Solid understanding of SRE fundamentals, including SLIs/SLOs, error budgets, and reliability-driven prioritization
- Strong experience with observability tooling and practices, including metrics, logging, tracing, alerting, and Golden Signals
- Strong incident management and on-call operations experience, including escalation and problem management
- Experience with Infrastructure as Code (e.g., Terraform) and cloud-native operational patterns
- Strong understanding of cloud-native microservices architecture and platform enablement patterns
- Ability to translate complex technical concepts into clear guidance for non-platform teams
- Excellent collaboration, communication, and stakeholder management skills.
Benefits
- Health insurance
- Flexible work arrangements
- Professional development
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
Platform EngineeringSite Reliability Engineering (SRE)KubernetesService MeshAWSAzureInfrastructure as CodeTerraformObservability toolingCloud-native microservices architecture
Soft Skills
Technical leadershipCollaborationCommunicationStakeholder managementMentorshipServant-leadershipDocumentationProblem managementPrioritizationExecution
Certifications
Bachelor’s Degree