Toku

AI Platform Engineer – ML Ops

Toku

full-time

Posted on:

Location Type: Remote

Location: Remote • 🇮🇳 India

Visit company website
AI Apply
Apply

Job Level

Mid-LevelSenior

Tech Stack

AWSCloudDockerEC2Python

About the role

  • Design, improve, and operate MLOps pipelines for training, deploying, and managing ML models in production.
  • Build and maintain CI/CD-style workflows for model packaging, versioning, and deployment across environments.
  • Operate and optimise AWS-based infrastructure for AI workloads, including compute, storage, and networking components.
  • Manage GPU-enabled workloads, addressing scalability, reliability, and cost-efficiency for high-load AI applications.
  • Implement monitoring and alerting for deployed models, focusing on system health, performance, and operational stability.
  • Own and evolve shared tooling such as MLflow, Docker-based workflows, and deployment frameworks to improve developer productivity.
  • Work closely with infrastructure, SRE, and engineering teams to align AI platform practices with broader system standards.
  • Support live AI services by diagnosing deployment, scaling, and infrastructure-related issues impacting AI features.
  • Ensure reproducibility, traceability, and governance across the full ML lifecycle, from experimentation to production.

Requirements

  • Hands-on experience building and operating MLOps pipelines for production ML systems.
  • Strong experience with AWS services used for AI workloads, including EC2, ECS, and SageMaker.
  • Practical experience with Docker and container-based deployment of ML workloads.
  • Experience with MLflow or similar tools for experiment tracking, model versioning, and lifecycle management.
  • Experience managing GPU-based workloads and addressing performance and cost challenges at scale.
  • Strong understanding of cloud infrastructure concepts as they apply to ML systems.
  • Ability to work with Python-based ML codebases to support deployment and lifecycle needs.
  • Working familiarity with LLMs, NLP models, and applied ML concepts sufficient to support deployment and monitoring (without owning core model development).
  • Proven experience supporting live, production ML systems with real customer impact.
  • Ability to work cross-functionally with applied AI engineers, backend engineers, and infra teams.
Benefits
  • Training and Development
  • Discretionary Yearly Bonus & Salary Review
  • Healthcare Coverage based on location
  • 20 days Paid Annual Leave (excluding Bank holidays)

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills
MLOpsCI/CDAWSDockerMLflowGPU managementPythonNLPLLMscloud infrastructure
Soft skills
cross-functional collaborationproblem-solvingcommunicationadaptabilityteamworkorganizational skillsanalytical thinkingattention to detailleadershipcustomer impact focus