Tech Stack
AzureCloudGrafanaITSMKubernetesLinuxMicroservicesPrometheusPythonSQLTerraform
About the role
- Serve as a technical point of contact for clients, ensuring clear communication and positive relationships during issue resolution and service delivery
- Manage scalable infrastructure using Azure IaaS, PaaS and SaaS services, including AKS, Azure networking, and cost estimation/control
- Provision infrastructure via Terraform and Azure DevOps pipelines; manage secrets, infra provisioning, and DevOps tooling
- Lead and participate in incident response, root cause analysis, and post-mortem reporting
- Apply ITSM frameworks for incident, change, and problem management and maintain operational reporting
- Develop and maintain automated tools and scripts for infrastructure monitoring, alerting, and incident response using Azure Monitor, Log Analytics, Application Insights, Prometheus, Grafana
- Provide technical mentorship to junior engineers and operational teams
- Provide Windows VM operational support including patching and backups
- Manage reliability and performance of mission-critical systems to ensure high availability and optimal performance
- Identify and implement process improvements to enhance system reliability, incident response times, and operational efficiencies
- Maintain comprehensive documentation for system architecture, configurations, and troubleshooting guides
Requirements
- A minimum of 3-5 years of pertinent, hands-on experience with Azure cloud technologies
- Strong expertise in Azure Kubernetes Service (AKS) and other Azure services including PaaS, SaaS, and IaaS (Azure App Service, Azure SQL Database, Azure Storage, Azure Functions, Azure Active Directory)
- Proficient with Terraform for infrastructure as code (IaC)
- Experience with Azure monitoring tools like Azure Monitor, Azure Log Analytics, and Application Insights
- Solid understanding of cloud infrastructure, container orchestration, and microservices architectures
- Experience in managing incidents and applying ITSM principles (incident management, change management, problem management)
- Hands-on experience with monitoring tools like Prometheus, Grafana, or similar
- Scripting and automation skills in Python, Bash, or PowerShell
- Basic knowledge in supporting Azure Windows and Linux VMs, including patching, backup, and basic troubleshooting
- Customer-facing experience with strong communication skills
- Problem-solving skills and ability to stay calm under pressure
- Prior experience in coaching and mentoring junior engineers
- Flexibility to adapt to changing priorities and work in a fast-paced environment
- Graduate degree in a technology-related field