NVIDIA

Senior MLOps Engineer

NVIDIA

full-time

Posted on:

Location Type: Remote

Location: Remote • California • 🇺🇸 United States

Visit company website
AI Apply
Apply

Salary

💰 $184,000 - $356,500 per year

Job Level

Senior

Tech Stack

AirflowCloudGoGrafanaKubernetesPrometheusPythonPyTorchRustTensorflow

About the role

  • Identify infrastructure and software bottlenecks to improve ML job startup time, data load/write time, resiliency, and failure recovery
  • Translate research workflows into automated, scalable, and reproducible systems that accelerate experimentation
  • Build CI/CD workflows tailored for ML to support data preparation, model training, validation, deployment, and monitoring
  • Develop observability frameworks to monitor performance, utilization, and health of large-scale training clusters
  • Collaborate with hardware and platform teams to optimize models for emerging GPU architectures, interconnects, and storage technologies
  • Develop guidelines for dataset versioning, experiment tracking, and model governance to ensure reliability and compliance
  • Mentor and guide engineering and research partners on MLOps patterns, scaling NVIDIA’s impact from research to production
  • Collaborate with NVIDIA Research teams and the DGX Cloud Customer Success team to enhance MLOps automation continuously

Requirements

  • BS in Computer Science, Information Systems, Computer Engineering or equivalent experience
  • 8+ years of experience in large-scale software or infrastructure systems, with 5+ years dedicated to ML platforms or MLOps
  • Proven track record designing and operating ML infrastructure for production training workloads
  • Expert knowledge of distributed training frameworks (PyTorch, TensorFlow, JAX) and orchestration systems (Kubernetes, Slurm, Kubeflow, Airflow, MLflow)
  • Strong programming experience in Python plus at least one systems language (Go, C++, Rust)
  • Deep understanding of GPU scheduling, container orchestration, and cloud-native environments
  • Experience integrating observability stacks (Prometheus, Grafana, ELK) with ML workloads
  • Familiarity with storage and data platforms that support large-scale training (object stores, feature stores, versioned datasets)
  • Strong communication abilities, collaborating effectively with research teams to transform requirements into scalable engineering solutions
Benefits
  • Equity
  • Benefits 📊 Resume Score Upload your resume to see if it passes auto-rejection tools used by recruiters Check Resume Score

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills
ML infrastructuredistributed training frameworksPyTorchTensorFlowJAXKubernetesSlurmKubeflowAirflowMLflow
Soft skills
strong communicationcollaborationmentoringguiding engineering partners
General Dynamics Information Technology

Senior Principal AI/ML Engineer/Architect – SECRET

General Dynamics Information Technology
Seniorfull-time$140k–$190k / year🇺🇸 United States
Posted: 1 hour agoSource: gdit.wd5.myworkdayjobs.com
AWSAzureCloudGoogle Cloud PlatformKubernetesPyTorchTensorflow
NVIDIA

Senior Deep Learning Engineer – Autonomous Vehicles

NVIDIA
Seniorfull-time$224k–$357k / yearCalifornia, Colorado, New York, Texas, Washington · 🇺🇸 United States
Posted: 2 hours agoSource: nvidia.wd5.myworkdayjobs.com
Distributed SystemsKubernetesPythonPyTorch
Zillow

Senior Machine Learning Engineer, Agentic AI

Zillow
Seniorfull-time$169k–$269k / yearCalifornia, Colorado, Connecticut, District of Columbia, Hawaii, Illinois, Maryland, Massachusetts, Minnesota, Nevada, New Jersey, New York, Rhode Island, Vermont, Washington · 🇺🇸 United States
Posted: 1 day agoSource: zillow.wd5.myworkdayjobs.com
Sentient Foundation

AI Research Scientist/Engineer

Sentient Foundation
Mid · Seniorfull-timeCalifornia · 🇺🇸 United States
Posted: 1 day agoSource: jobs.ashbyhq.com
Python