Principal MLOps Engineer

Solventum

Lead MLOps engineering as a Principal MLOps Engineer at Solventum, shaping AI integration in healthcare systems. Define operational standards and ensure reliability in clinical environments.

Posted 6/4/2026full-timeRemote • Pennsylvania • 🇺🇸 United StatesLead💰 $142,800 - $196,350 per yearWebsite

ATS Keywords

Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills

PythonJavaGoCloud ProvidersKubernetesKubeflowAirflowmonitoring stacksbackend frameworksmicroservices

Soft Skills

leadershipcommunicationorganizationalmentoringincident managementproblem-solvingcollaborationproduction disciplinegovernancereporting

Tools & Technologies

AWSGCPAzurePrometheusGrafanaDatadogCI/CD pipelinesoperational runbooksdeployment pipelinesversion control

Certifications & Qualifications

Bachelor's DegreeMaster's Degree

Industry Keywords

Healthcare Information SystemsAI integrationoperational architecturerelease processesmodel registrySLAsSLOsautomated checksproduction systemscompliance support

Tech Stack

Tools & technologies

AirflowAWSAzureCloudGoGoogle Cloud PlatformGrafanaJavaKubernetesMicroservicesPrometheusPython

About the role

Key responsibilities & impact

Lead the operational architecture, deployment strategy, and reliability engineering for integrating AI into high-stakes Healthcare Information Systems (HIS)
Define the enterprise operational standards, govern the release processes, and build the resilient infrastructure required to maintain models in mission-critical clinical environments
Architect and govern the comprehensive release process, defining enterprise checklists, automated approval gates, release notes, and deployment readiness standards
Establish the deployment execution standards for promoting AI across all environments and ensure customer deployments adhere to strict internal production discipline
Architect and oversee the enterprise model registry, ensuring seamless integration with CI/CD pipelines and full version control traceability
Define and enforce monitoring standards, establishing critical SLAs/SLOs, service health metrics, and comprehensive dashboards across the AI ecosystem
Architect automated checks for input/output data quality and model drift, ensuring proactive detection of system degradation
Establish and lead the production incident process, including rigorous triage workflows, severity escalation paths, postmortems, rollback mechanisms, and recovery infrastructure
Partner with Platform teams to provide essential ATO (Authority to Operate) and compliance support, ensuring complete deployment traceability and strict operational controls
Oversee comprehensive operational reporting, providing leadership with status updates across production systems, pre-prod testing, customer rollouts, and incident metrics
Foster a culture of production discipline, guiding junior engineers in maintaining operational runbooks and reliable deployment pipelines

Requirements

What you’ll need

Bachelor's Degree or Higher in Computer Science, Software Engineering, or related technical field
10+ years of experience in software engineering, with at least 6 years dedicated to deploying and maintaining large-scale ML systems in production
Expert-level experience with Cloud Providers (AWS/GCP/Azure) and orchestration tools (Kubernetes, Kubeflow, or Airflow)
Expert-level Python and Java/Go (or similar)
Deep proficiency in backend frameworks, microservices, and system design patterns
Expert knowledge of monitoring stacks (Prometheus, Grafana, Datadog) and establishing enterprise SLAs/SLOs for AI services
Proven track record of designing automated deployment pipelines, managing complex rollback procedures, and enforcing model registry governance at scale.

Benefits

Comp & perks

Medical
Dental & Vision
Health Savings Accounts
Health Care & Dependent Care Flexible Spending Accounts
Disability Benefits
Life Insurance
Voluntary Benefits
Paid Absences
Retirement Benefits