FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.
Tech Stack
Tools & technologiesAzureCloudDistributed SystemsDockerGrafanaKubernetesPrometheusTerraform
About the role
Key responsibilities & impact- Design, implement, and operate highly available, scalable, and fault-tolerant systems on Microsoft Azure.
- Define, track, and improve reliability metrics, including service health indicators and operational performance reporting.
- Develop and maintain Infrastructure as Code using Bicep, ARM, Terraform, or similar tooling to ensure consistent and reproducible environments.
- Build automation for provisioning, deployment, scaling, and operational workflows, reducing manual intervention and operational toil.
- Enhance CI/CD pipelines in collaboration with DevOps to improve deployment safety, reliability, and efficiency.
- Implement and maintain monitoring, logging, tracing, and alerting solutions to ensure real-time visibility and rapid issue detection.
- Define meaningful alerting strategies that reduce noise and improve response effectiveness.
- Lead incident response activities, including structured troubleshooting, stakeholder communication, root cause analysis, and post-incident reviews.
- Strengthen incident and problem management processes to improve SLA adherence and customer impact mitigation.
- Implement systemic improvements to prevent repeat incidents rather than applying short-term workarounds.
- Embed security and compliance best practices across infrastructure, including access control, encryption, and policy enforcement.
- Drive continuous service improvement initiatives to enhance performance, reliability, efficiency, and operational maturity.
- Collaborate closely with Development and QA teams to improve application resilience and supportability.
Requirements
What you’ll need- Proven experience as a Site Reliability Engineer or in a similar reliability-focused role within a SaaS or cloud-native environment.
- Strong hands-on experience operating production workloads on Microsoft Azure across compute, networking, storage, and monitoring services.
- Infrastructure as Code expertise using Bicep, ARM, Terraform, or similar tools.
- Strong automation and scripting capability (PowerShell essential; additional scripting languages advantageous).
- Experience working with containerised environments (Docker) and orchestration concepts such as Kubernetes.
- Practical experience with observability tooling such as Azure Monitor, Grafana, Prometheus, Datadog, or OpenTelemetry.
- Strong understanding of structured incident response, root cause analysis, SLA/SLO concepts, and reliability engineering practices.
- Knowledge of security best practices and compliance standards such as ISO27001, SOC 2, and GDPR.
- Strong problem-solving capability with the ability to troubleshoot complex, distributed systems.
- Effective communication skills and ability to collaborate across engineering, operations, and business stakeholders.
- Azure certifications (e.g., Azure Administrator Associate, Azure Solutions Architect Expert) are desirable.
Benefits
Comp & perks- 25 days Annual Leave
- Private pension
- Bonus scheme
- Private health
- Life assurance
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
Site Reliability EngineeringProduction Workloads ManagementContainerization (Docker)Kubernetes OrchestrationReliability Engineering Practices
Soft Skills
Effective CommunicationCollaboration Across TeamsProblem-Solving Capability
Certifications
Azure Administrator AssociateAzure Solutions Architect Expert
