Apply

Ready to go for it?

AI Apply speeds things up—apply directly if you prefer.

FREE ACCESS
5,000–10,000 jobs/day
JobTailor Logo

See all jobs on JobTailor

Search thousands of fresh jobs every day.

Discover
  • Fresh listings
  • Fast filters
  • No subscription required
Create a free account and start exploring right away.
Embrace Software Inc

Azure CloudOps Engineer

Embrace Software Inc

CloudOps Engineer managing Azure infrastructure across multiple environments for AI-driven software solutions. Collaborating with engineering teams to ensure security, reliability, and scalability of platforms.

Posted 5/29/2026full-timeRemote • 🇺🇸 United StatesSeniorLeadWebsite

Tech Stack

Tools & technologies
AzureCloudDNSFirewallsPostgresPythonRTOSTerraformVault

About the role

Key responsibilities & impact
  • Manage and support Azure infrastructure across dev, QA, staging, and production
  • Maintain operational health of Static Web Apps, Container Apps, PostgreSQL, Storage Accounts, SignalR, Service Bus, Azure AI Foundry, Azure Arc, and related services
  • Ensure resources are provisioned, configured, monitored, maintained, and retired per company standards
  • Support environment setup for new products, customers, and integrations
  • Identify and resolve infrastructure issues affecting performance, reliability, availability, or security
  • Build and maintain Terraform modules and environment configurations
  • Ensure infrastructure changes are version-controlled, peer-reviewed, tested, and approved
  • Manage Terraform state, workspaces, variables, secrets, and deployment workflows
  • Detect and resolve drift between Terraform and deployed Azure resources
  • Standardize naming, tagging, resource group structure, environment isolation, and module patterns
  • Build, maintain, and troubleshoot GitHub Actions workflows for application and infrastructure deployments
  • Support CI/CD pipelines across multiple SaaS products and environments
  • Implement promotion flows from dev to QA to staging to production
  • Add deployment safeguards: environment protection rules, approvals, rollback procedures, validation checks, release gates, and audit trails
  • Manage pipeline secrets, service principals, managed identities, and deployment credentials
  • Improve build and deployment reliability, speed, and traceability
  • Operate and monitor Azure AI services, including Azure AI Foundry and Speech-to-Text workloads
  • Support production operations for LLM integrations and AI-enabled product features
  • Monitor AI service availability, latency, quota usage, token consumption, API failures, throttling, and cost
  • Help define operational standards for AI workloads: access control, logging, alerting, failover, usage governance, and provider disruption handling
  • Partner with engineering to troubleshoot AI service issues, integration failures, degraded model responses, or provider-side disruptions
  • Support secure handling of AI secrets, endpoints, keys, managed identities, and private network access
  • Implement and maintain monitoring with Azure Monitor, Log Analytics, Application Insights, and related tools
  • Build dashboards for infrastructure, application, database, messaging, storage, AI service, and deployment health
  • Configure alerts for availability, latency, errors, resource saturation, queue depth, failed jobs, failed deployments, database health, quota exhaustion, and cost anomalies
  • Improve signal quality by reducing noise and ensuring alerts are actionable
  • Participate in production incident response for infrastructure, deployments, integrations, and platform services
  • Triage and resolve issues across Azure services, CI/CD, Terraform, networking, databases, messaging, and AI integrations
  • Create and maintain runbooks for common operational issues
  • Support root cause analysis and post-incident reviews
  • Implement preventive actions after incidents to improve reliability
  • Help define severity levels, escalation paths, response expectations, on-call processes, and production support procedures
  • Implement cloud security best practices across Azure environments
  • Manage Azure RBAC, managed identities, service principals, Key Vault access, and least-privilege permissions
  • Secure GitHub Actions workflows, deployment credentials, environment secrets, and production access
  • Support secret rotation, certificate management, and secure configuration management
  • Enforce network security via private endpoints, firewalls, IP restrictions, and environment-specific access rules
  • Support audit and compliance readiness for SOC 2, ISO 27001, or similar frameworks
  • Support Azure PostgreSQL operations: backups, restores, performance monitoring, connection limits, HA, and capacity planning
  • Monitor and maintain Azure Storage Accounts, lifecycle policies, access controls, backup strategy, and usage trends
  • Support Azure Service Bus operations: queue/topic monitoring, dead-letter handling, retry behavior, and throughput
  • Support SignalR operational health, connection metrics, scaling behavior, and related production issues
  • Monitor Azure spend across products, environments, services, and customers where applicable
  • Implement tagging standards to support cost allocation by product, environment, customer, or business unit
  • Build cost dashboards, budget alerts, anomaly detection, and recurring cost reviews
  • Identify underutilized resources and recommend right-sizing opportunities
  • Review AI service costs, LLM and token usage, STT usage, storage growth, database sizing, and environment costs
  • Recommend savings plans, reservations, scaling rules, lifecycle policies, or shutdown schedules
  • Define and maintain backup and recovery procedures for critical cloud services
  • Test database restores and validate backup reliability
  • Help define RTOs and RPOs for production systems
  • Support disaster recovery planning for SaaS products and customer-facing services
  • Improve resilience through scaling rules, failover patterns, health checks, synthetic monitoring, and production readiness reviews
  • Create and maintain CloudOps documentation, runbooks, deployment guides, and environment standards
  • Define standards for naming, tagging, logging, alerting, access control, Terraform structure, GitHub Actions patterns, and production changes
  • Document procedures for cloud services, CI/CD workflows, AI services, and incident response
  • Enable engineering teams with reusable patterns, templates, and self-service guidance

Requirements

What you’ll need
  • 7+ years of hands-on experience operating production workloads in Microsoft Azure
  • Strong experience with Terraform and infrastructure as code
  • Experience building and maintaining CI/CD pipelines using GitHub Actions
  • Experience with containerized workloads, preferably Azure Container Apps or similar
  • Experience with Azure Monitor, Log Analytics, and Application Insights
  • Experience with Azure PostgreSQL or similar managed relational databases
  • Strong understanding of Azure networking, DNS, identity, RBAC, managed identities, Key Vault, and security best practices
  • Experience troubleshooting production incidents across infrastructure, deployments, networking, and cloud services
  • Comfortable scripting in Bash, PowerShell, Python, or similar
  • Strong documentation, communication, and cross-functional collaboration skills

Benefits

Comp & perks
  • Competitive salary commensurate with experience
  • Opportunities for career advancement and professional development
  • Experience collaborating with a diverse, global team within a remote work setting

ATS Keywords

✓ Tailor your resume
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
AzureTerraformCI/CDGitHub ActionsPostgreSQLBashPowerShellPythonAzure MonitorLog Analytics
Soft Skills
documentationcommunicationcross-functional collaboration