Opus 2

Machine Learning Operations Engineer, AI

Opus 2

full-time

Posted on:

Origin:  • 🇬🇧 United Kingdom

Visit company website
AI Apply
Apply

Job Level

Mid-LevelSenior

Tech Stack

AWSAzureCloudDockerGoogle Cloud PlatformGrafanaJavaKubernetesPrometheusPythonTerraformTypeScript

About the role

  • Design, build, and maintain MLOps infrastructure, establishing ML CI/CD best practices including model testing, versioning, and deployment.
  • Develop and manage scalable automated pipelines for training, evaluating, and deploying ML models, with emphasis on LLM systems.
  • Implement robust monitoring and logging for production models to track performance, drift, and data quality.
  • Collaborate with Data Scientists to containerize and productionize models and algorithms, including RAG and Graph RAG approaches.
  • Manage and optimize cloud infrastructure for ML workloads (e.g., Amazon Bedrock), focusing on performance, cost-effectiveness, and scalability.
  • Automate provisioning of ML infrastructure using Infrastructure as Code.
  • Integrate ML models into production product architecture, working closely with product and engineering teams.
  • Own operational aspects of AI lifecycle: deployment, A/B testing, incident response, and continuous improvement.
  • Contribute to AI strategy and roadmap, advising on operational feasibility and scalability of AI features.
  • Collaborate with Principal Data Scientists and Principal Engineers to ensure MLOps framework supports full AI workflows and model interaction layers.

Requirements

  • Practical and automation-driven engineer mindset focused on reliability, scalability, and efficiency.
  • Hands-on experience building and managing CI/CD pipelines for machine learning.
  • Comfortable writing production-quality code and reviewing PRs.
  • Proven track record implementing MLOps best practices in production.
  • Curious about operational challenges of LLMs and building robust systems to support them.
  • Experience with model lifecycle management and experiment tracking.
  • Ability to design and implement infrastructure for complex AI systems, including vector stores and graph databases.
  • Proven ability to ensure performance and reliability of systems over time.
  • 3+ years of experience in an MLOps, DevOps, or Software Engineering role focused on ML infrastructure.
  • Proficiency in Python and experience building/maintaining infrastructure and automation.
  • Experience with Java or TypeScript is beneficial.
  • Deep experience with at least one major cloud provider (AWS, GCP, Azure) and their ML services (e.g., SageMaker, Vertex AI); experience with Amazon Bedrock is a significant plus.
  • Strong familiarity with containerization (Docker) and orchestration (Kubernetes).
  • Experience with Infrastructure as Code (e.g., Terraform, CloudFormation).
  • Experience deploying and managing LLM-powered features in production.
  • Bonus: experience with monitoring tools (Prometheus, Grafana), agent orchestration, or legaltech domain knowledge.