The Walt Disney Company

Senior MLOps Engineer

The Walt Disney Company

full-time

Posted on:

Origin:  • 🇺🇸 United States • California

Visit company website
AI Apply
Apply

Salary

💰 $152,100 - $203,900 per year

Job Level

Senior

Tech Stack

AWSAzureCloudDockerGoGoogle Cloud PlatformGrafanaKubernetesPrometheusPythonTensorflow

About the role

  • Develop, deploy, and maintain scalable infrastructure for machine learning model training, retraining, and inference
  • Design and optimize CI/CD pipelines specifically tailored for machine learning workflows
  • Implement robust monitoring and logging systems to track model performance and identify potential issues
  • Collaborate with AI researchers and data scientists to ensure infrastructure aligns with project requirements and supports iterative experimentation
  • Manage compute resources (cloud and on-premises) to enable large-scale distributed training and inference tasks
  • Containerize machine learning models and applications using Docker and deploy via Kubernetes or equivalent orchestration systems
  • Automate deployment workflows for serving ML models using frameworks such as TorchServe, TensorFlow Serving and FastAPI
  • Implement model versioning, rollback strategies, and governance for maintaining production stability
  • Optimize cost efficiency and performance of machine learning workflows in cloud environments such as AWS, GCP, or Azure
  • Stay updated with emerging ML Ops tools and practices, integrating them into existing workflows to improve performance and reliability

Requirements

  • Bachelor’s in Computer Science, Engineering, or a related field
  • Master’s Degree is preferred
  • 5+ years of experience in DevOps, Site Reliability Engineering, or a related role, with at least 2+ years focusing on ML Ops
  • Expertise in building and maintaining CI/CD pipelines for machine learning applications
  • Strong proficiency with containerization (Docker) and orchestration tools (Kubernetes)
  • Proficiency in deploying machine learning models using frameworks such as TensorFlow Serving, TorchServe, or custom APIs
  • Deep understanding of cloud infrastructure and services (AWS, GCP, or Azure) for ML workloads, including GPUs and TPU utilization
  • Experience managing large-scale distributed training workflows and optimizing resource allocation
  • Familiarity with tools like MLflow, DVC, Weight+Biases, or similar for data and model tracking and versioning
  • Solid understanding of security best practices for machine learning systems and sensitive data handling
  • Strong scripting and programming skills in Python, Bash, or Go
  • Experience with data orchestration tools like DataChain, Weights and Biases, etc., preferred
  • Hands-on experience with automated hyperparameter tuning and optimization frameworks preferred
  • Familiarity with model monitoring tools like Prometheus, Grafana, or custom solutions preferred
  • Experience integrating pre-trained foundational models and managing their deployment at scale preferred
  • Contributions to open-source ML Ops projects or relevant research publications preferred
Articul8 AI

Senior Site Reliability Engineer, SRE

Articul8 AI
Seniorfull-timeCalifornia · 🇺🇸 United States
Posted: 12 days agoSource: jobs.ashbyhq.com
AWSAzureCloudDistributed SystemsDockerGoGoogle Cloud PlatformGrafanaKubernetesNoSQLPrometheusPython+2 more
Articul8 AI

Senior Software Development Engineer in Test, Chaos Engineering Specialist

Articul8 AI
Seniorfull-timeCalifornia · 🇺🇸 United States
Posted: 12 days agoSource: jobs.ashbyhq.com
AWSAzureCloudDistributed SystemsGoGoogle Cloud PlatformGrafanaKubernetesPrometheusPythonRust
qode.world

Infrastructure Engineer, Kafka and GenAI

qode.world
Mid · Seniorfull-time🇺🇸 United States
Posted: 23 days agoSource: apply.workable.com
ApacheAWSAzureCloudDockerGoGoogle Cloud PlatformGrafanaJenkinsKafkaKubernetesPrometheus+4 more
Hazelcast

Lead Platform Engineer, Build and Release

Hazelcast
Seniorfull-time🇬🇧 United Kingdom
Posted: 7 days agoSource: hazelcast.pinpointhq.com
AWSAzureCloudGrafanaJavaJenkinsPrometheusPythonTerraform
CodingChiefs: Dedicated Remote Developers

Senior Site Reliability Engineer

CodingChiefs: Dedicated Remote Developers
Seniorfull-time🇵🇭 Philippines
Posted: 10 days agoSource: codingchiefsbv.recruitee.com
AWSCloudDockerEC2GoGrafanaJavaJenkinsKubernetesMySQLPostgresPrometheus+2 more