Tech Stack
AnsibleAWSAzureCloudDistributed SystemsGoGoogle Cloud PlatformJavaLinuxPythonRubyTerraformUnix
About the role
- Lead, mentor, and grow a team of high-performing Site Reliability Engineers
- Foster a collaborative and supportive team environment, promoting continuous learning and professional development
- Conduct regular one-on-one meetings, performance reviews, and provide constructive feedback
- Recruit, onboard, and retain top SRE talent
- Drive the implementation and continuous improvement of SRE principles and practices, including SLIs, SLOs, error budgets, and post-mortems
- Oversee the design, development, and maintenance of robust, scalable, and highly available infrastructure and applications
- Implement and optimize monitoring, alerting, and logging solutions to ensure proactive identification and resolution of issues
- Participate in incident response, root cause analysis, and problem resolution processes, ensuring effective communication and remediation
- Maintain and operate disaster recovery plans and strategies
- Reduce operational toil through automation and tooling development
- Participate in an on-call rotation every 4 weeks
- Collaborate closely with development, product, and other engineering teams to integrate reliability into the entire software development lifecycle
- Influence architectural decisions to ensure systems are built with reliability, scalability, and maintainability in mind
- Define and track key reliability metrics and report on the overall health and performance of our systems
- Contribute to the long-term technical vision and strategy for the SRE function
Requirements
- Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience
- 5+ years of experience in Site Reliability Engineering, DevOps, or a similar role
- 5+ years of experience in a leadership or management position, leading SRE or infrastructure teams
- Experience with large-scale distributed systems and cloud platforms (e.g., AWS, Azure, GCP)
- Experience with SRE principles, methodologies, and tools
- Experience in at least one scripting or programming language (e.g., Python, Go, Java, Ruby)
- Experience with infrastructure as code (e.g., Terraform, CloudFormation, Ansible)
- Experience with networking, operating systems (Linux/Unix), and database technologies
- Competitive salary
- Flexible working hours
- Professional development budget
- Home office setup allowance
- Global team events
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
Site Reliability EngineeringDevOpsSRE principlesscripting languagesPythonGoJavaRubyinfrastructure as codeTerraform
Soft skills
leadershipmentoringcollaborationcommunicationperformance reviewsfeedbackproblem resolutioncontinuous learningteam environmentinfluence