
Senior Azure Site Reliability Engineer
Manila Recruitment
full-time
Posted on:
Location Type: Remote
Location: Philippines
Visit company websiteExplore more
Job Level
Tech Stack
About the role
- You will be responsible to provisioning and managing of cloud infrastructure on Azure public cloud to support organizational needs.
- is responsible for ensuring the reliability, availability, and performance of cloud-based infrastructure and applications deployed on Microsoft Azure.
- This role involves automating operations, monitoring system health, optimizing performance, and troubleshooting complex issues to maintain a highly available and secure cloud environment.
- The SRE will work closely with development, security, and IT operations teams to enhance cloud solutions, implement best practices, and support scalable and resilient systems.
- Deploy and manage Azure cloud services including Virtual Machines, Storage, Redis, Azure SQL databases, virtual networks, and AKS clusters (Azure Kubernetes Service).
- Automate provisioning, configuration, and deployments using PowerShell, Bash, and Ansible.
- Deliver and deploy Azure infrastructure using Infrastructure as Code (IaC), specifically Azure bicep
- Review, Configure and implement monitoring functionalities to provide best visibility and transparency to level 1 support teams.
- Implement and Troubleshoot CI/CD pipelines for application deployments in Azure DevOps, Team City, Octopus
- Maintain system reliability using Azure Monitor, Application Insights, Log Analytics, and Prometheus/Grafana, Splunk, Ops-Genie, Slack.
- Optimize performance and cost efficiency of Azure resources.
- Train junior members of the team to deliver best of breed solutions on top of Azure public cloud.
- Review, manage, and troubleshoot Azure Kubernetes Service (AKS) clusters.
- Review and Manage Cloud and On-Prem servers including AKS in terms of OS, RMQ Upgrades, Security Patches, Application Service support.
- Respond to system alerts, failures, and security incidents.
- Perform root cause analysis (RCA) and implement preventive measures.
- Provide Level 2 support in on-call capacity based on pre-approved schedule (including weekends).
- Review the network and security design for all infrastructure and applications hosted in Azure.
- Continuously promote better ways to deliver Infrastructure solutions on Azure cloud.
- Propose adoption of new approaches, patterns, techniques, and ideas recommended by industry standards and industry trends.
- Work closely with Software development and network teams to enhance platform reliability and identity better approaches.
- Administer and optimize Linux-based systems used for application hosting, ensuring stability, security, and performance in production and non-production environments.
- Troubleshoot issues in Linux operating systems, services, and middleware components to support application availability.
Requirements
- At least 3 years of proven experience in delivering infrastructure solutions on Azure cloud.
- 5+ years of hands-on experience with infrastructure design and deployment utilizing PaaS, SaaS and IaaS cloud offerings.
- At least 2 years of experience with Windows Server
- Experience with either Azure ARM templates or Azure Biceps
- At least 3 years of experience in Linux Administration and managing Linux Based OS, Applications
- At least 2 years of hands-on experience designing, building, and deploying containerized runtime environments based on Azure Kubernetes Services
- 1+ years of proven experience administering RabbitMQ clusters and Nginx
- Proven experience with scripting languages like: PowerShell, Python, JavaScript, Bash
- Experience using Splunk, Grafana, Ops-Genie is an asset
- __**Advantageous skills:**__
- - Relevant certifications
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
Azure cloudInfrastructure as Code (IaC)PowerShellBashAnsibleAzure Kubernetes Service (AKS)Linux AdministrationRabbitMQNginxCI/CD pipelines
Soft Skills
troubleshootingautomationmonitoringperformance optimizationteam collaborationtrainingroot cause analysisproblem-solvingcommunicationbest practices implementation