Tech Stack
AWSAzureCloudDockerGoogle Cloud PlatformGrafanaJavaKubernetesPrometheusPythonTerraformTypeScript
About the role
- Design, build, and maintain MLOps infrastructure, establishing ML CI/CD best practices including model testing, versioning, and deployment.
- Develop and manage scalable automated pipelines for training, evaluating, and deploying ML models, with emphasis on LLM systems.
- Implement robust monitoring and logging for production models to track performance, drift, and data quality.
- Collaborate with Data Scientists to containerize and productionize models and algorithms, including RAG and Graph RAG approaches.
- Manage and optimize cloud infrastructure for ML workloads (e.g., Amazon Bedrock), focusing on performance, cost-effectiveness, and scalability.
- Automate provisioning of ML infrastructure using Infrastructure as Code.
- Integrate ML models into production product architecture, working closely with product and engineering teams.
- Own operational aspects of AI lifecycle: deployment, A/B testing, incident response, and continuous improvement.
- Contribute to AI strategy and roadmap, advising on operational feasibility and scalability of AI features.
- Collaborate with Principal Data Scientists and Principal Engineers to ensure MLOps framework supports full AI workflows and model interaction layers.
Requirements
- Practical and automation-driven engineer mindset focused on reliability, scalability, and efficiency.
- Hands-on experience building and managing CI/CD pipelines for machine learning.
- Comfortable writing production-quality code and reviewing PRs.
- Proven track record implementing MLOps best practices in production.
- Curious about operational challenges of LLMs and building robust systems to support them.
- Experience with model lifecycle management and experiment tracking.
- Ability to design and implement infrastructure for complex AI systems, including vector stores and graph databases.
- Proven ability to ensure performance and reliability of systems over time.
- 3+ years of experience in an MLOps, DevOps, or Software Engineering role focused on ML infrastructure.
- Proficiency in Python and experience building/maintaining infrastructure and automation.
- Experience with Java or TypeScript is beneficial.
- Deep experience with at least one major cloud provider (AWS, GCP, Azure) and their ML services (e.g., SageMaker, Vertex AI); experience with Amazon Bedrock is a significant plus.
- Strong familiarity with containerization (Docker) and orchestration (Kubernetes).
- Experience with Infrastructure as Code (e.g., Terraform, CloudFormation).
- Experience deploying and managing LLM-powered features in production.
- Bonus: experience with monitoring tools (Prometheus, Grafana), agent orchestration, or legaltech domain knowledge.