
Senior Site Reliability Engineer
Niche
full-time
Posted on:
Location Type: Remote
Location: Argentina
Visit company websiteExplore more
Job Level
Tech Stack
About the role
- Own and architect cloud infrastructure across AWS and GCP, including EC2, EKS/Kubernetes, RDS, ElastiCache, S3, and networking components (VPCs, load balancers, DNS), driving improvements that increase reliability and reduce operational burden
- Lead the design and implementation of secrets management strategies using Hashicorp Vault and other tools, establishing organizational standards for secure configuration management
- Architect and evolve infrastructure-as-code practices using Terraform, driving adoption of patterns that improve consistency and reduce deployment risk
- Design and optimize deployment pipelines and CI/CD systems, troubleshoot complex deployment failures with Git and FluxCD, and establish best practices for safe, reliable releases
- Support database operations including migrations and performance tuning
- Own Kafka clusters and message queue systems, including architecture decisions, capacity planning, and troubleshooting complex processing issues
- Participate in 24/7 oncall rotations, responding to alerts, triaging incidents, and coordinating with development teams to resolve production issues
- Design and implement monitoring, alerting, and observability strategies using Prometheus, Grafana, Sumo Logic, and related tools, establishing organizational standards that catch issues before customers notice them
- Define and own Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for critical services, balancing business needs with engineering resources
- Lead blameless post-mortems, write comprehensive incident analyses that teach others, and drive systemic improvements that prevent entire classes of incidents
- Champion access controls, IAM policies, and security configurations across cloud environments, ensuring infrastructure meets compliance and security requirements
- Identify and eliminate systemic sources of operational toil by designing automation, building self-service tooling, and improving developer workflows that scale the team's impact
- Lead AI-assisted automation initiatives to streamline SRE processes, implementing solutions that reduce toil and improve incident response
- Partner with product development teams as the reliability subject matter expert, providing architecture guidance, production readiness reviews, and proactive capacity planning
- Mentor and coach SRE team members, helping them develop technical skills and operational judgment through pairing, code review, and incident response shadowing
- Lead knowledge sharing initiatives, demos, and cross-team collaboration to elevate reliability culture and operational excellence across the engineering organization
Requirements
- 5+ years experience with cloud platforms (AWS or GCP) and container orchestration systems (Kubernetes/Docker)
- Experience with cloud networking concepts and services including VPCs, subnets, security groups, NAT gateways, VPC peering, load balancers, and DNS management (Route 53, Cloud DNS)
- Strong programming skills in one or more languages (Python, Go, Bash) with demonstrated ability to build automation and tooling
- Advanced experience with Infrastructure as Code tools (Terraform, Helm, Ansible) including module design and organizational standards
- Deep understanding of Linux systems administration and networking fundamentals (TCP/IP, DNS, load balancing, distributed systems)
- Experience with SQL databases (PostgreSQL, MySQL, or SQL Server) including performance tuning and capacity planning
- Experience designing and operating CI/CD pipelines for reliable software delivery
- Track record of leading incident response and driving complex issues to resolution
- Demonstrated ability to mentor engineers and contribute to team technical growth
- Excellent collaboration and communication skills, with ability to influence technical decisions across teams.
Benefits
- All interviews are being held remotely
- If there are preparations we can make to help ensure you have a comfortable and positive interview experience, please let us know.
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
AWSGCPKubernetesTerraformCI/CDPythonGoBashSQLLinux
Soft Skills
mentoringcollaborationcommunicationincident responseleadershipproblem-solvinginfluencingcoachingknowledge sharingorganizational standards