Salary
💰 $180,000 - $220,000 per year
Tech Stack
AirflowApacheAWSCloudDistributed SystemsDockerETLGoogle Cloud PlatformGrafanaKubernetesPrometheusPythonRaySparkSpringSQLTerraform
About the role
- Infrastructure Development & Optimization: Architect and implement robust, scalable ML infrastructure that supports model training, deployment, and monitoring.
- ML Platform Engineering: Develop and maintain ML model serving and orchestration platforms, ensuring seamless integration with existing engineering workflows. Gitlab pipelines for software and machine learning engineering
- Data Pipeline and Feature Engineering: Design and optimize ETL/ELT pipelines for ML applications, enabling efficient and reliable data preprocessing and transformation.
- MLOps and Automation: Implement MLOps best practices to streamline model lifecycle management, from training to deployment, monitoring, and retraining.
- Cloud & Containerization: Leverage cloud computing resources (AWS, GCP) and container orchestration (Docker, Kubernetes) to scale ML workloads efficiently.
- Monitoring and Reliability: Develop advanced monitoring systems to track model performance, data drift, and infrastructure health.
- Security & Compliance: Collaborate with privacy and security teams to ensure compliance with regulatory standards and best practices for handling sensitive clinical data.
- Collaboration & Mentorship: Work closely with software engineers, data scientists, and ML engineers to align infrastructure with business and technical goals while mentoring junior engineers.
- Stay Current on Engineering and ML Infrastructure Trends: Keep up to date with advancements in ML platforms, distributed computing, and scalable ML systems, integrating innovative solutions into our ML ecosystem.
Requirements
- Background in Production Distributed Systems: You’ve worked with complicated distributed systems, and understand how to deploy, monitor, and appropriately alert on these systems in production.
- Extensive ML Infrastructure Experience: 4+ years of experience in machine learning infrastructure, data engineering, or distributed systems, with a strong focus on building scalable, high-performance ML platforms.
- Strong ML Workflow Expertise: Deep understanding of ML pipeline orchestration, model deployment, and monitoring in production environments.
- Cloud and MLOps Proficiency: Hands-on experience with cloud ML platforms (AWS SageMaker, GCP Vertex AI) and orchestration tools (Kubeflow, Airflow, or Dagster).
- Programming & Automation Skills: Proficiency in Python, SQL, and infrastructure-as-code (Terraform, CloudFormation) to automate ML workflows.
- Scalable Data Processing: Experience with distributed data processing frameworks such as Apache Spark, Ray, or Dask for handling large-scale ML datasets.
- Containerization & DevOps: Strong background in Docker, Kubernetes, CI/CD, and monitoring tools (Prometheus, Grafana) for infrastructure management.
- Security & Compliance Awareness: Knowledge of best practices for data governance, security, and regulatory compliance, particularly in healthcare or life sciences.
- Strong Problem-Solving & Collaboration Skills: Ability to troubleshoot complex ML infrastructure issues and work cross-functionally with engineers, data scientists, and product teams.