Lead Machine Learning Operations Engineer

Paramount

Lead Machine Learning Operations Engineer at Paramount overseeing reliability and governance of ML systems. Focus on production health, incident response, and operational rigor.

Posted 6/5/2026full-timeRemote • New York • 🇺🇸 United StatesSenior💰 $157,000 - $235,000 per yearWebsite

ATS Keywords

Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills

machine learning engineeringMLOpsreliability engineeringmodel validationdata qualitySQLmodel registriesfeature storesML observabilityproduction monitoring

Soft Skills

cross-functional collaborationwritten communicationverbal communicationinfluencing architecturementoringtechnical leadershipproblem-solvingincident responsestakeholder engagementoperational rigor

Tools & Technologies

ML metadata systemsmodel deployment pipelinesmonitoring toolsdiagnostic workflowsdashboardsalerting systemsanomaly detection toolscanary strategiesrollback playbookshotfix strategies

Industry Keywords

production ML systemsmodel healthmodel traceabilityend-to-end ML systemsdata driftfeature driftmodel behavior changesSLA performancepost-deployment metricsbusiness outcome measurement

Tech Stack

Tools & technologies

SQL

About the role

Key responsibilities & impact

Own ML production reliability strategy
Define and lead the operational strategy for production ML systems, including monitoring, traceability, deployment safety, incident response, and post-deployment validation.
Set the standards ML teams use to assess model health, performance, and trustworthiness in production.
Own model traceability and governance
Ensure every production model has clear lineage (data, features, code, artifacts, validation, deployment history) and drive adoption of model registry and metadata tooling across ML teams.
Build end-to-end ML observability
Design and implement monitoring across the full ML signal path: data arrival, feature freshness, distribution stability, candidate generation, ranking behavior, model metrics, serving latency, and SLA performance.
Define production health metrics
Partner with ML, data, product, and business stakeholders to define post-deployment metrics covering model quality, system reliability, business guardrails, and degradation indicators.
Detect drift and degradation proactively
Detect data drift, feature drift, model behavior changes, and silent failures before they impact customers via thresholding, alerting, anomaly detection, and release-over-release monitoring.
Lead diagnostic tooling and root-cause analysis
Build dashboards, logs, and diagnostic workflows that progress quickly from 'recommendations look off' to root cause, with context captured across candidates, features, scores, ranking decisions, and downstream outcomes.
Own ML deployment safety
Define and operate automated gates that prevent bad models or bad data from being promoted to production.
Partner with MLEs to establish validation checks, rollback criteria, canary strategies, shadow testing, and release health reviews.
Lead ML incident response
Own incident response practices for ML systems, including rollback playbooks, hotfix strategies, severity definitions, tradeoff frameworks, communications, and post-mortems.
Drive closure of systemic gaps after incidents rather than only resolving the immediate issue.
Partner across ML Platform, Data, and ML Partner with DevOps/Platform on infrastructure and observability needs; with Data Engineering on data quality, drift, and freshness; and with ML Engineering to embed operational requirements into development and deployment workflows.
Set standards and mentor others Act as the technical lead for ML operations: establish reusable patterns, playbooks, and standards, and mentor engineers on reliability, observability, and operational rigor.

Requirements

What you’ll need

5+ years of experience in machine learning engineering, ML platform, applied ML, MLOps, data platform, reliability engineering, or a related technical role.
Demonstrated experience operating production ML systems, including monitoring, deployment, incident response, model validation, data quality, or reliability ownership.
Experience leading technical initiatives across multiple engineering teams, especially where success required influencing architecture, tooling, standards, or adoption.
Hands-on experience with model registries, feature stores, ML metadata systems, production monitoring, model deployment pipelines, or ML observability platforms.
Solid knowledge of end-to-end ML systems, including training data, features, model artifacts, offline validation, online serving, post-deployment metrics, and business outcome measurement.
Ability to reason about ML operational failure modes: stale features, distribution shift, training-serving skew, delayed labels, and offline-online metric gaps.
Solid SQL skills and comfort investigating data quality, feature distributions, model outputs, pipeline behavior, and production anomalies.
Track record of cross-functional collaboration with Platform, Data, and ML Engineering to deliver production-grade operational capabilities.
Solid written and verbal communication skills, including the ability to explain ML system health, risks, incidents, and tradeoffs to both technical and non-technical stakeholders.

Benefits

Comp & perks

medical
dental
vision
401(k) plan
life insurance coverage
disability benefits
tuition assistance program
PTO