Salary
💰 $120,000 - $130,000 per year
Tech Stack
AnsibleAWSCloudDistributed SystemsJenkinsLinuxPythonSwiftSwitchingTerraform
About the role
- Oversee the design, implementation, and maintenance of reliable infrastructure and services.
- Collaborate with other teams to define requirements, standards, and best practices.
- Identify and address performance bottlenecks and ensure system stability.
- Implement and improve monitoring and observability frameworks.
- Manage on-call rotations and incident response to minimize downtime and ensure swift resolution.
- Drive automation efforts to reduce manual tasks and improve efficiency.
- Implement structured engineering and operations processes.
- Analyze and evaluate existing processes to identify opportunities for improvement.
- Develop and implement the long-term reliability strategy for the organization.
- Make decisions about build vs. buy for tools and technologies.
- Ensure alignment with business goals and customer expectations.
- Manage relationships with vendors and other stakeholders.
- Act as a bridge between technical teams and other departments.
- Represent the SRE team to stakeholders and communicate effectively.
- Collaborate with other engineering teams to ensure efficient workflows.
- Foster a culture of blameless postmortems and continuous learning.
Requirements
- Strong technical background in distributed systems, cloud computing, and related technologies.
- Proven experience in managing and mentoring technical teams.
- Excellent problem-solving and communication skills.
- Experience with monitoring, automation, and incident management.
- Understanding of SLOs, SLIs, and SLAs.
- Familiarity with DevOps and Agile practices.