Zelis

ML Ops Engineer

Zelis

full-time

Posted on:

Location Type: Remote

Location: New JerseyUnited States

Visit company website

Explore more

AI Apply
Apply

Salary

💰 $127,000 - $160,550 per year

About the role

  • Build and maintain monitoring infrastructure for conventional machine learning models, with capabilities for performance tracking, drift detection, and alerting.
  • Research, evaluate, and implement monitoring strategies and tools for Generative AI systems, including LLMs and Agentic AI architectures.
  • Collaborate with ML Engineers, Data Scientists, and DevOps teams to deploy, manage, and monitor models in production.
  • Develop and support scalable, secure, and automated data pipelines using Snowflake, SQL, and Python for training, serving, and monitoring ML and GenAI models.
  • Leverage AutoML tools and frameworks (e.g., MLflow, Kubeflow, SageMaker Autopilot) to streamline experimentation and deployment.
  • Design dashboards and reporting systems to visualize model health metrics and surface key operational insights.
  • Ensure auditability, reproducibility, and compliance for model performance and data flow in production environments, with consideration for regulatory standards like GDPR and HIPAA.
  • Maintain CI/CD workflows and version-controlled codebases (e.g., Git) for ML infrastructure and pipelines.
  • Utilize containerization and orchestration technologies (e.g., Docker) to manage scalable ML infrastructure.
  • Leverage tools such as Streamlit and Python visualization libraries to present insights from model and data monitoring.
  • Perform root cause analyses on model degradation or data quality issues, and proactively implement improvements.
  • Stay current on industry developments related to ML observability, model governance, responsible GenAI practices, and AI security.
  • Contribute to analytics projects and data engineering initiatives as needed.
  • Provide off-hours support for critical deployments or urgent data/model issues.

Requirements

  • 2–5 years of experience in ML Ops, ML Engineering, or a related role with a focus on production-level model monitoring, automation, and deployment.
  • Strong experience with ML observability tools or custom-built monitoring systems.
  • Experience with monitoring LLMs and Generative AI models, including prompt evaluation, hallucination tracking, and agent behavior auditing.
  • Experience in deploying and managing ML workloads using containerization and orchestration platforms such as Docker, Kubernetes, Kubeflow, or TensorFlow Extended.
  • Familiarity with AutoML pipelines and workflow management tools (e.g., MLflow, SageMaker Autopilot).
  • Experience working in cloud environments, preferably AWS (e.g., SageMaker, S3, Lambda, ECS/EKS).
  • Understanding of ML lifecycle tools (e.g., MLflow, SageMaker Pipelines) and CI/CD practices.
  • Strong security and compliance awareness, particularly related to model/data governance (e.g., HIPAA, GDPR).
  • Proficiency in Python and key data libraries (Pandas, Numpy, Matplotlib, etc.).
  • Advanced SQL skills and experience with Snowflake or similar data warehousing platforms.
  • Proficiency with version control (Git) and agile development methodologies.
  • Strong collaboration and communication skills, with the ability to explain technical issues to both technical and non-technical stakeholders.
  • Bachelor’s degree in Computer Science, Engineering, Data Science, or a related field—or equivalent industry experience.
  • Domain experience in healthcare data (claims, payments) is preferred.
Benefits
  • 401k plan with employer match
  • flexible paid time off
  • holidays
  • parental leaves
  • life and disability insurance
  • health benefits including medical, dental, vision, and prescription drug coverage
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
machine learningmonitoring infrastructuredata pipelinesPythonSQLAutoMLML observabilitycontainerizationorchestrationdata governance
Soft Skills
collaborationcommunicationproblem-solvingroot cause analysistechnical explanation