Paradigm Health

Staff SW Engineer, DevOps/MLOps

Paradigm Health

full-time

Posted on:

Origin:  • 🇺🇸 United States

Visit company website
AI Apply
Manual Apply

Salary

💰 $180,000 - $220,000 per year

Job Level

Lead

Tech Stack

AirflowApacheAWSCloudDistributed SystemsDockerETLGoogle Cloud PlatformGrafanaKubernetesPrometheusPythonRaySparkSpringSQLTerraform

About the role

  • Infrastructure Development & Optimization: Architect and implement robust, scalable ML infrastructure that supports model training, deployment, and monitoring.
  • ML Platform Engineering: Develop and maintain ML model serving and orchestration platforms, ensuring seamless integration with existing engineering workflows. Gitlab pipelines for software and machine learning engineering
  • Data Pipeline and Feature Engineering: Design and optimize ETL/ELT pipelines for ML applications, enabling efficient and reliable data preprocessing and transformation.
  • MLOps and Automation: Implement MLOps best practices to streamline model lifecycle management, from training to deployment, monitoring, and retraining.
  • Cloud & Containerization: Leverage cloud computing resources (AWS, GCP) and container orchestration (Docker, Kubernetes) to scale ML workloads efficiently.
  • Monitoring and Reliability: Develop advanced monitoring systems to track model performance, data drift, and infrastructure health.
  • Security & Compliance: Collaborate with privacy and security teams to ensure compliance with regulatory standards and best practices for handling sensitive clinical data.
  • Collaboration & Mentorship: Work closely with software engineers, data scientists, and ML engineers to align infrastructure with business and technical goals while mentoring junior engineers.
  • Stay Current on Engineering and ML Infrastructure Trends: Keep up to date with advancements in ML platforms, distributed computing, and scalable ML systems, integrating innovative solutions into our ML ecosystem.

Requirements

  • Background in Production Distributed Systems: You’ve worked with complicated distributed systems, and understand how to deploy, monitor, and appropriately alert on these systems in production.
  • Extensive ML Infrastructure Experience: 4+ years of experience in machine learning infrastructure, data engineering, or distributed systems, with a strong focus on building scalable, high-performance ML platforms.
  • Strong ML Workflow Expertise: Deep understanding of ML pipeline orchestration, model deployment, and monitoring in production environments.
  • Cloud and MLOps Proficiency: Hands-on experience with cloud ML platforms (AWS SageMaker, GCP Vertex AI) and orchestration tools (Kubeflow, Airflow, or Dagster).
  • Programming & Automation Skills: Proficiency in Python, SQL, and infrastructure-as-code (Terraform, CloudFormation) to automate ML workflows.
  • Scalable Data Processing: Experience with distributed data processing frameworks such as Apache Spark, Ray, or Dask for handling large-scale ML datasets.
  • Containerization & DevOps: Strong background in Docker, Kubernetes, CI/CD, and monitoring tools (Prometheus, Grafana) for infrastructure management.
  • Security & Compliance Awareness: Knowledge of best practices for data governance, security, and regulatory compliance, particularly in healthcare or life sciences.
  • Strong Problem-Solving & Collaboration Skills: Ability to troubleshoot complex ML infrastructure issues and work cross-functionally with engineers, data scientists, and product teams.