webAI™

Staff DevOps Engineer

webAI™

full-time

Posted on:

Location Type: Hybrid

Location: AustinTexasUnited States

Visit company website

Explore more

AI Apply
Apply

Job Level

About the role

  • Design and architect secure, scalable cloud and edge infrastructure for deploying AI workloads across multi-cloud (AWS, Azure, GCP) and hybrid environments
  • Build and maintain production-grade Infrastructure as Code (IaC) using Terraform, Ansible, or Pulumi, managing 100+ resources with GitOps workflows and automated validation
  • Design and operate production Kubernetes clusters optimized for AI/ML workloads with GPU support, implementing container security, multi-tenancy, and resource optimization
  • Implement secure CI/CD pipelines with integrated security controls (SAST, DAST, vulnerability scanning, secrets management) and automated deployment workflows for containerized AI models
  • Lead MLOps infrastructure initiatives including model deployment pipelines, versioning, feature stores, experiment tracking, and monitoring for model performance and drift
  • Design comprehensive observability and monitoring using Prometheus, Grafana, ELK, or Datadog with distributed tracing, APM, and real-time alerting aligned to SLIs/SLOs
  • Implement security best practices including least-privilege access, encryption at rest/in transit, network segmentation, and automated compliance validation
  • Lead incident response and reliability initiatives, participate in on-call rotation, conduct post-mortems, and drive continuous improvement for system reliability
  • Architect disaster recovery and business continuity strategies with automated backup, failover, and recovery processes
  • Develop reusable infrastructure modules and templates to accelerate environment provisioning and standardize deployment patterns across teams
  • Mentor mid-level and senior engineers on cloud architecture, DevOps best practices, and platform reliability through design reviews and technical guidance
  • Drive technical documentation and knowledge sharing including runbooks, architecture decision records (ADRs), and infrastructure standards

Requirements

  • 7+ years of hands-on experience in DevOps, Site Reliability Engineering, or Infrastructure Engineering with proven track record of architecting production systems
  • Expert-level proficiency with Docker, Kubernetes (CKA/CKAD preferred), and cloud-native technologies in production environments
  • 5+ years implementing Infrastructure as Code with Terraform, Ansible, or Pulumi, managing large-scale (50+) cloud resources
  • Deep experience with cloud platforms (AWS, Azure, or GCP) including compute, networking, storage, and managed services
  • Proven experience building and scaling CI/CD pipelines with integrated security controls (GitHub Actions, GitLab CI, Jenkins, ArgoCD)
  • Strong programming skills in Python (preferred for automation), Bash, or Go for infrastructure tooling and automation
  • Production experience with observability and monitoring tools: Prometheus, Grafana, ELK, CloudWatch, Datadog, or similar
  • Experience with MLOps workflows: model deployment automation, versioning, and lifecycle management
  • Demonstrated experience with GitOps methodologies and declarative infrastructure management
  • Strong understanding of security best practices: encryption, secrets management, identity and access management (IAM), network security
  • Excellent written and verbal communication skills for technical documentation and cross-functional collaboration.
Benefits
  • Competitive salary and performance-based incentives.
  • Comprehensive health, dental, and vision benefits package.
  • 401k Match (US-based only)
  • $200/mos Health and Wellness Stipend
  • $400/year Continuing Education Credit
  • $500/year Function Health subscription (US-based only)
  • Free parking, for in-office employees
  • Unlimited Approved PTO
  • Parental Leave for Eligible Employees
  • Supplemental Life Insurance
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
Infrastructure as CodeTerraformAnsiblePulumiKubernetesDockerCI/CDPythonBashGo
Soft Skills
leadershipmentoringcommunicationcollaborationincident responsecontinuous improvementtechnical documentation
Certifications
CKACKAD