Spearhead the Site Reliability Engineering function to ensure availability, scalability, and performance of core systems
Take responsibility for monitoring .NET applications deployed in AKS, EKS, App Services, and VMs
Design, implement, and maintain robust monitoring and alerting systems
Analyse system performance metrics, establish baselines, identify bottlenecks, and implement improvements for scalability and efficiency
Set up, configure, and optimise observability tools (Prometheus, Grafana, Datadog) to monitor metrics, logs, and traces
Ensure high availability and disaster recovery for critical systems; lead incident response and post-incident analysis
Develop and maintain SLOs, SLIs, and error budgets to meet reliability targets
Automate routine tasks and use infrastructure-as-code (Terraform, Ansible, Bicep) to manage cloud resources
Collaborate with DevOps/CloudOps and product development teams to build and deploy infrastructure via CI/CD (Azure DevOps, GitLab CI)
Mentor junior SREs and drive best practices across the engineering organisation
Identify areas for continuous improvement and stay up-to-date with industry trends, tools, and technologies
Requirements
5+ years of experience in Site Reliability Engineering, DevOps, or Systems Engineering with a strong focus on monitoring, alerting and incident management
Hands-on experience monitoring .NET applications in production (Grafana, Datadog, Azure Monitor)
Extensive experience with AKS, EKS, App Services, and VMs in cloud environments (AWS, Azure)
Strong proficiency in cloud platforms (AWS, Azure) and container orchestration (Kubernetes, AKS, EKS)
Proficiency in infrastructure-as-code tools (Terraform, Azure Resource Manager, Bicep, Ansible)
Experience with monitoring and observability tools (Prometheus, Grafana, Datadog)