
Staff DevOps Engineer
webAI™
full-time
Posted on:
Location Type: Hybrid
Location: Austin • Texas • United States
Visit company websiteExplore more
Job Level
Tech Stack
About the role
- Design and architect secure, scalable cloud and edge infrastructure for deploying AI workloads across multi-cloud (AWS, Azure, GCP) and hybrid environments
- Build and maintain production-grade Infrastructure as Code (IaC) using Terraform, Ansible, or Pulumi, managing 100+ resources with GitOps workflows and automated validation
- Design and operate production Kubernetes clusters optimized for AI/ML workloads with GPU support, implementing container security, multi-tenancy, and resource optimization
- Implement secure CI/CD pipelines with integrated security controls (SAST, DAST, vulnerability scanning, secrets management) and automated deployment workflows for containerized AI models
- Lead MLOps infrastructure initiatives including model deployment pipelines, versioning, feature stores, experiment tracking, and monitoring for model performance and drift
- Design comprehensive observability and monitoring using Prometheus, Grafana, ELK, or Datadog with distributed tracing, APM, and real-time alerting aligned to SLIs/SLOs
- Implement security best practices including least-privilege access, encryption at rest/in transit, network segmentation, and automated compliance validation
- Lead incident response and reliability initiatives, participate in on-call rotation, conduct post-mortems, and drive continuous improvement for system reliability
- Architect disaster recovery and business continuity strategies with automated backup, failover, and recovery processes
- Develop reusable infrastructure modules and templates to accelerate environment provisioning and standardize deployment patterns across teams
- Mentor mid-level and senior engineers on cloud architecture, DevOps best practices, and platform reliability through design reviews and technical guidance
- Drive technical documentation and knowledge sharing including runbooks, architecture decision records (ADRs), and infrastructure standards
Requirements
- 7+ years of hands-on experience in DevOps, Site Reliability Engineering, or Infrastructure Engineering with proven track record of architecting production systems
- Expert-level proficiency with Docker, Kubernetes (CKA/CKAD preferred), and cloud-native technologies in production environments
- 5+ years implementing Infrastructure as Code with Terraform, Ansible, or Pulumi, managing large-scale (50+) cloud resources
- Deep experience with cloud platforms (AWS, Azure, or GCP) including compute, networking, storage, and managed services
- Proven experience building and scaling CI/CD pipelines with integrated security controls (GitHub Actions, GitLab CI, Jenkins, ArgoCD)
- Strong programming skills in Python (preferred for automation), Bash, or Go for infrastructure tooling and automation
- Production experience with observability and monitoring tools: Prometheus, Grafana, ELK, CloudWatch, Datadog, or similar
- Experience with MLOps workflows: model deployment automation, versioning, and lifecycle management
- Demonstrated experience with GitOps methodologies and declarative infrastructure management
- Strong understanding of security best practices: encryption, secrets management, identity and access management (IAM), network security
- Excellent written and verbal communication skills for technical documentation and cross-functional collaboration.
Benefits
- Competitive salary and performance-based incentives.
- Comprehensive health, dental, and vision benefits package.
- 401k Match (US-based only)
- $200/mos Health and Wellness Stipend
- $400/year Continuing Education Credit
- $500/year Function Health subscription (US-based only)
- Free parking, for in-office employees
- Unlimited Approved PTO
- Parental Leave for Eligible Employees
- Supplemental Life Insurance
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
Infrastructure as CodeTerraformAnsiblePulumiKubernetesDockerCI/CDPythonBashGo
Soft Skills
leadershipmentoringcommunicationcollaborationincident responsecontinuous improvementtechnical documentation
Certifications
CKACKAD