Machine Learning Operations Engineer, AI

Opus 2

full-time

Posted on: 9/9/2025

Location: 🇬🇧 United Kingdom

✨ AI Apply

Mid-LevelSenior

AWSAzureCloudDockerGoogle Cloud PlatformGrafanaJavaKubernetesPrometheusPythonTerraformTypeScript

About the role

Design, build, and maintain MLOps infrastructure, establishing ML CI/CD best practices including model testing, versioning, and deployment.
Develop and manage scalable automated pipelines for training, evaluating, and deploying ML models, with emphasis on LLM systems.
Implement robust monitoring and logging for production models to track performance, drift, and data quality.
Collaborate with Data Scientists to containerize and productionize models and algorithms, including RAG and Graph RAG approaches.
Manage and optimize cloud infrastructure for ML workloads (e.g., Amazon Bedrock), focusing on performance, cost-effectiveness, and scalability.
Automate provisioning of ML infrastructure using Infrastructure as Code.
Integrate ML models into production product architecture, working closely with product and engineering teams.
Own operational aspects of AI lifecycle: deployment, A/B testing, incident response, and continuous improvement.
Contribute to AI strategy and roadmap, advising on operational feasibility and scalability of AI features.
Collaborate with Principal Data Scientists and Principal Engineers to ensure MLOps framework supports full AI workflows and model interaction layers.

Practical and automation-driven engineer mindset focused on reliability, scalability, and efficiency.
Hands-on experience building and managing CI/CD pipelines for machine learning.
Comfortable writing production-quality code and reviewing PRs.
Proven track record implementing MLOps best practices in production.
Curious about operational challenges of LLMs and building robust systems to support them.
Experience with model lifecycle management and experiment tracking.
Ability to design and implement infrastructure for complex AI systems, including vector stores and graph databases.
Proven ability to ensure performance and reliability of systems over time.
3+ years of experience in an MLOps, DevOps, or Software Engineering role focused on ML infrastructure.
Proficiency in Python and experience building/maintaining infrastructure and automation.
Experience with Java or TypeScript is beneficial.
Deep experience with at least one major cloud provider (AWS, GCP, Azure) and their ML services (e.g., SageMaker, Vertex AI); experience with Amazon Bedrock is a significant plus.
Strong familiarity with containerization (Docker) and orchestration (Kubernetes).
Experience with Infrastructure as Code (e.g., Terraform, CloudFormation).
Experience deploying and managing LLM-powered features in production.
Bonus: experience with monitoring tools (Prometheus, Grafana), agent orchestration, or legaltech domain knowledge.