Salary
💰 $152,100 - $203,900 per year
Tech Stack
AWSAzureCloudDockerGoGoogle Cloud PlatformGrafanaKubernetesPrometheusPythonTensorflow
About the role
- Develop, deploy, and maintain scalable infrastructure for machine learning model training, retraining, and inference
- Design and optimize CI/CD pipelines specifically tailored for machine learning workflows
- Implement robust monitoring and logging systems to track model performance and identify potential issues
- Collaborate with AI researchers and data scientists to ensure infrastructure aligns with project requirements and supports iterative experimentation
- Manage compute resources (cloud and on-premises) to enable large-scale distributed training and inference tasks
- Containerize machine learning models and applications using Docker and deploy via Kubernetes or equivalent orchestration systems
- Automate deployment workflows for serving ML models using frameworks such as TorchServe, TensorFlow Serving and FastAPI
- Implement model versioning, rollback strategies, and governance for maintaining production stability
- Optimize cost efficiency and performance of machine learning workflows in cloud environments such as AWS, GCP, or Azure
- Stay updated with emerging ML Ops tools and practices, integrating them into existing workflows to improve performance and reliability
Requirements
- Bachelor’s in Computer Science, Engineering, or a related field
- Master’s Degree is preferred
- 5+ years of experience in DevOps, Site Reliability Engineering, or a related role, with at least 2+ years focusing on ML Ops
- Expertise in building and maintaining CI/CD pipelines for machine learning applications
- Strong proficiency with containerization (Docker) and orchestration tools (Kubernetes)
- Proficiency in deploying machine learning models using frameworks such as TensorFlow Serving, TorchServe, or custom APIs
- Deep understanding of cloud infrastructure and services (AWS, GCP, or Azure) for ML workloads, including GPUs and TPU utilization
- Experience managing large-scale distributed training workflows and optimizing resource allocation
- Familiarity with tools like MLflow, DVC, Weight+Biases, or similar for data and model tracking and versioning
- Solid understanding of security best practices for machine learning systems and sensitive data handling
- Strong scripting and programming skills in Python, Bash, or Go
- Experience with data orchestration tools like DataChain, Weights and Biases, etc., preferred
- Hands-on experience with automated hyperparameter tuning and optimization frameworks preferred
- Familiarity with model monitoring tools like Prometheus, Grafana, or custom solutions preferred
- Experience integrating pre-trained foundational models and managing their deployment at scale preferred
- Contributions to open-source ML Ops projects or relevant research publications preferred