Build and maintain monitoring infrastructure for conventional machine learning models, with capabilities for performance tracking, drift detection, and alerting.
Research, evaluate, and implement monitoring strategies and tools for Generative AI systems, including LLMs and Agentic AI architectures.
Collaborate with ML Engineers, Data Scientists, and DevOps teams to deploy, manage, and monitor models in production.
Develop and support scalable, secure, and automated data pipelines using Snowflake, SQL, and Python for training, serving, and monitoring ML and GenAI models.
Leverage AutoML tools and frameworks (e.g., MLflow, Kubeflow, SageMaker Autopilot) to streamline experimentation and deployment.
Design dashboards and reporting systems to visualize model health metrics and surface key operational insights.
Ensure auditability, reproducibility, and compliance for model performance and data flow in production environments, with consideration for regulatory standards like GDPR and HIPAA.
Maintain CI/CD workflows and version-controlled codebases (e.g., Git) for ML infrastructure and pipelines.
Utilize containerization and orchestration technologies (e.g., Docker) to manage scalable ML infrastructure.
Leverage tools such as Streamlit and Python visualization libraries to present insights from model and data monitoring.
Perform root cause analyses on model degradation or data quality issues, and proactively implement improvements.
Stay current on industry developments related to ML observability, model governance, responsible GenAI practices, and AI security.
Contribute to analytics projects and data engineering initiatives as needed.
Provide off-hours support for critical deployments or urgent data/model issues.

Requirements

2–5 years of experience in ML Ops, ML Engineering, or a related role with a focus on production-level model monitoring, automation, and deployment.
Strong experience with ML observability tools or custom-built monitoring systems.
Experience with monitoring LLMs and Generative AI models, including prompt evaluation, hallucination tracking, and agent behavior auditing.
Experience in deploying and managing ML workloads using containerization and orchestration platforms such as Docker, Kubernetes, Kubeflow, or TensorFlow Extended.
Familiarity with AutoML pipelines and workflow management tools (e.g., MLflow, SageMaker Autopilot).
Experience working in cloud environments, preferably AWS (e.g., SageMaker, S3, Lambda, ECS/EKS).
Understanding of ML lifecycle tools (e.g., MLflow, SageMaker Pipelines) and CI/CD practices.
Strong security and compliance awareness, particularly related to model/data governance (e.g., HIPAA, GDPR).
Proficiency in Python and key data libraries (Pandas, Numpy, Matplotlib, etc.).
Advanced SQL skills and experience with Snowflake or similar data warehousing platforms.
Proficiency with version control (Git) and agile development methodologies.
Strong collaboration and communication skills, with the ability to explain technical issues to both technical and non-technical stakeholders.
Bachelor’s degree in Computer Science, Engineering, Data Science, or a related field—or equivalent industry experience.
Domain experience in healthcare data (claims, payments) is preferred.

Benefits

401k plan with employer match
flexible paid time off
holidays
parental leaves
life and disability insurance
health benefits including medical, dental, vision, and prescription drug coverage

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

machine learningmonitoring infrastructuredata pipelinesPythonSQLAutoMLML observabilitycontainerizationorchestrationdata governance

Soft Skills

collaborationcommunicationproblem-solvingroot cause analysistechnical explanation