FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.

Azure CloudOps Engineer
Embrace Software IncCloudOps Engineer managing Azure infrastructure across multiple environments for AI-driven software solutions. Collaborating with engineering teams to ensure security, reliability, and scalability of platforms.
Tech Stack
Tools & technologiesAzureCloudDNSFirewallsPostgresPythonRTOSTerraformVault
About the role
Key responsibilities & impact- Manage and support Azure infrastructure across dev, QA, staging, and production
- Maintain operational health of Static Web Apps, Container Apps, PostgreSQL, Storage Accounts, SignalR, Service Bus, Azure AI Foundry, Azure Arc, and related services
- Ensure resources are provisioned, configured, monitored, maintained, and retired per company standards
- Support environment setup for new products, customers, and integrations
- Identify and resolve infrastructure issues affecting performance, reliability, availability, or security
- Build and maintain Terraform modules and environment configurations
- Ensure infrastructure changes are version-controlled, peer-reviewed, tested, and approved
- Manage Terraform state, workspaces, variables, secrets, and deployment workflows
- Detect and resolve drift between Terraform and deployed Azure resources
- Standardize naming, tagging, resource group structure, environment isolation, and module patterns
- Build, maintain, and troubleshoot GitHub Actions workflows for application and infrastructure deployments
- Support CI/CD pipelines across multiple SaaS products and environments
- Implement promotion flows from dev to QA to staging to production
- Add deployment safeguards: environment protection rules, approvals, rollback procedures, validation checks, release gates, and audit trails
- Manage pipeline secrets, service principals, managed identities, and deployment credentials
- Improve build and deployment reliability, speed, and traceability
- Operate and monitor Azure AI services, including Azure AI Foundry and Speech-to-Text workloads
- Support production operations for LLM integrations and AI-enabled product features
- Monitor AI service availability, latency, quota usage, token consumption, API failures, throttling, and cost
- Help define operational standards for AI workloads: access control, logging, alerting, failover, usage governance, and provider disruption handling
- Partner with engineering to troubleshoot AI service issues, integration failures, degraded model responses, or provider-side disruptions
- Support secure handling of AI secrets, endpoints, keys, managed identities, and private network access
- Implement and maintain monitoring with Azure Monitor, Log Analytics, Application Insights, and related tools
- Build dashboards for infrastructure, application, database, messaging, storage, AI service, and deployment health
- Configure alerts for availability, latency, errors, resource saturation, queue depth, failed jobs, failed deployments, database health, quota exhaustion, and cost anomalies
- Improve signal quality by reducing noise and ensuring alerts are actionable
- Participate in production incident response for infrastructure, deployments, integrations, and platform services
- Triage and resolve issues across Azure services, CI/CD, Terraform, networking, databases, messaging, and AI integrations
- Create and maintain runbooks for common operational issues
- Support root cause analysis and post-incident reviews
- Implement preventive actions after incidents to improve reliability
- Help define severity levels, escalation paths, response expectations, on-call processes, and production support procedures
- Implement cloud security best practices across Azure environments
- Manage Azure RBAC, managed identities, service principals, Key Vault access, and least-privilege permissions
- Secure GitHub Actions workflows, deployment credentials, environment secrets, and production access
- Support secret rotation, certificate management, and secure configuration management
- Enforce network security via private endpoints, firewalls, IP restrictions, and environment-specific access rules
- Support audit and compliance readiness for SOC 2, ISO 27001, or similar frameworks
- Support Azure PostgreSQL operations: backups, restores, performance monitoring, connection limits, HA, and capacity planning
- Monitor and maintain Azure Storage Accounts, lifecycle policies, access controls, backup strategy, and usage trends
- Support Azure Service Bus operations: queue/topic monitoring, dead-letter handling, retry behavior, and throughput
- Support SignalR operational health, connection metrics, scaling behavior, and related production issues
- Monitor Azure spend across products, environments, services, and customers where applicable
- Implement tagging standards to support cost allocation by product, environment, customer, or business unit
- Build cost dashboards, budget alerts, anomaly detection, and recurring cost reviews
- Identify underutilized resources and recommend right-sizing opportunities
- Review AI service costs, LLM and token usage, STT usage, storage growth, database sizing, and environment costs
- Recommend savings plans, reservations, scaling rules, lifecycle policies, or shutdown schedules
- Define and maintain backup and recovery procedures for critical cloud services
- Test database restores and validate backup reliability
- Help define RTOs and RPOs for production systems
- Support disaster recovery planning for SaaS products and customer-facing services
- Improve resilience through scaling rules, failover patterns, health checks, synthetic monitoring, and production readiness reviews
- Create and maintain CloudOps documentation, runbooks, deployment guides, and environment standards
- Define standards for naming, tagging, logging, alerting, access control, Terraform structure, GitHub Actions patterns, and production changes
- Document procedures for cloud services, CI/CD workflows, AI services, and incident response
- Enable engineering teams with reusable patterns, templates, and self-service guidance
Requirements
What you’ll need- 7+ years of hands-on experience operating production workloads in Microsoft Azure
- Strong experience with Terraform and infrastructure as code
- Experience building and maintaining CI/CD pipelines using GitHub Actions
- Experience with containerized workloads, preferably Azure Container Apps or similar
- Experience with Azure Monitor, Log Analytics, and Application Insights
- Experience with Azure PostgreSQL or similar managed relational databases
- Strong understanding of Azure networking, DNS, identity, RBAC, managed identities, Key Vault, and security best practices
- Experience troubleshooting production incidents across infrastructure, deployments, networking, and cloud services
- Comfortable scripting in Bash, PowerShell, Python, or similar
- Strong documentation, communication, and cross-functional collaboration skills
Benefits
Comp & perks- Competitive salary commensurate with experience
- Opportunities for career advancement and professional development
- Experience collaborating with a diverse, global team within a remote work setting
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
AzureTerraformCI/CDGitHub ActionsPostgreSQLBashPowerShellPythonAzure MonitorLog Analytics
Soft Skills
documentationcommunicationcross-functional collaboration