Senior MLOps Engineer

The Walt Disney Company

full-time

Posted on: 9/19/2025

Origin: • 🇺🇸 United States • California

✨ AI Apply

💰 $152,100 - $203,900 per year

Senior

AWSAzureCloudDockerGoGoogle Cloud PlatformGrafanaKubernetesPrometheusPythonTensorflow

About the role

Develop, deploy, and maintain scalable infrastructure for machine learning model training, retraining, and inference
Design and optimize CI/CD pipelines specifically tailored for machine learning workflows
Implement robust monitoring and logging systems to track model performance and identify potential issues
Collaborate with AI researchers and data scientists to ensure infrastructure aligns with project requirements and supports iterative experimentation
Manage compute resources (cloud and on-premises) to enable large-scale distributed training and inference tasks
Containerize machine learning models and applications using Docker and deploy via Kubernetes or equivalent orchestration systems
Automate deployment workflows for serving ML models using frameworks such as TorchServe, TensorFlow Serving and FastAPI
Implement model versioning, rollback strategies, and governance for maintaining production stability
Optimize cost efficiency and performance of machine learning workflows in cloud environments such as AWS, GCP, or Azure
Stay updated with emerging ML Ops tools and practices, integrating them into existing workflows to improve performance and reliability

Bachelor’s in Computer Science, Engineering, or a related field
Master’s Degree is preferred
5+ years of experience in DevOps, Site Reliability Engineering, or a related role, with at least 2+ years focusing on ML Ops
Expertise in building and maintaining CI/CD pipelines for machine learning applications
Strong proficiency with containerization (Docker) and orchestration tools (Kubernetes)
Proficiency in deploying machine learning models using frameworks such as TensorFlow Serving, TorchServe, or custom APIs
Deep understanding of cloud infrastructure and services (AWS, GCP, or Azure) for ML workloads, including GPUs and TPU utilization
Experience managing large-scale distributed training workflows and optimizing resource allocation
Familiarity with tools like MLflow, DVC, Weight+Biases, or similar for data and model tracking and versioning
Solid understanding of security best practices for machine learning systems and sensitive data handling
Strong scripting and programming skills in Python, Bash, or Go
Experience with data orchestration tools like DataChain, Weights and Biases, etc., preferred
Hands-on experience with automated hyperparameter tuning and optimization frameworks preferred
Familiarity with model monitoring tools like Prometheus, Grafana, or custom solutions preferred
Experience integrating pre-trained foundational models and managing their deployment at scale preferred
Contributions to open-source ML Ops projects or relevant research publications preferred