Tech Stack
AnsibleAWSAzureCloudGoogle Cloud PlatformGrafanaKubernetesPrometheusSplunkTerraform
About the role
- Achieve measurable improvements in system uptime and performance by implementing robust reliability engineering practices and leading incident prevention initiatives.
- Reduce Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR) through streamlined incident response protocols and team readiness.
- Build, lead, and develop a skilled team of Customer Reliability Engineers with a strong focus on ownership, collaboration, and continuous learning.
- Ensure that reliability is embedded into service design, development, deployment, and operations by partnering with engineering, product, and operations teams.
- Deliver clear and actionable reporting on reliability metrics to support leadership decision-making and continuous improvement.
- Align reliability goals with customer expectations by addressing root causes of service degradation and championing seamless user experiences.
- Identify and address potential reliability risks before they impact customers by implementing observability tools, runbooks, and automated responses.
- Drive reliability improvements that reduce operational costs by eliminating manual processes, optimizing resource usage, and reducing reactive work.
- Oversee timely incident response, root cause analysis, and implementation of long-term fixes to prevent recurring issues and improve service resilience.
- Work closely with software engineering, DevOps, product, and support teams to embed reliability into the end-to-end service lifecycle.
- Ensure effective monitoring systems, dashboards, and alerts are in place to detect, respond to, and analyze system performance and failures.
- Define and drive the implementation of a reliability roadmap aligned with business objectives, system scalability, and customer needs.
- Translate system performance into customer impact metrics (e.g., NPS, downtime minutes) and continuously enhance the end-user experience.
- Track and report on key reliability metrics such as uptime, latency, error rates, and incident frequency to support transparency and data-driven decisions.
- Proactively identify technical and operational risks, ensuring mitigation strategies are in place and aligned with compliance standards.
- Foster a culture of experimentation and improvement by exploring automation, new tools, and process enhancements to strengthen reliability practices.
Requirements
- Bachelor's Degree in Computer Science, Software Engineering, Information Technology, or a related technical discipline.
- Certifications in relevant areas such as Site Reliability Engineering (SRE), DevOps, ITIL, or Cloud Infrastructure (e.g., AWS, Azure, GCP) are highly desirable.
- A Master's Degree in Technology Management, Engineering, or Business Administration is an added advantage.
- Experience: 7–10 years of experience in IT operations, systems engineering, or reliability engineering within a technology-driven environment.
- At least 3–5 years in a leadership or managerial role, with proven experience leading reliability or DevOps team.
- Hands-on experience implementing and managing observability platforms, monitoring tools (e.g., Prometheus, Grafana, Splunk), and automation frameworks.
- Demonstrated ability to lead incident response efforts, conduct root cause analysis, and implement sustainable, long-term service reliability improvements.
- Experience working in agile environments and with cross-functional teams, including software development, infrastructure, product, and support.
- Strong understanding of cloud-native technologies, container orchestration (e.g., Kubernetes), CI/CD pipelines, and infrastructure as code (e.g., Terraform, Ansible).