
AI Platform Engineer – ML Ops
Toku
full-time
Posted on:
Location Type: Remote
Location: Remote • 🇮🇳 India
Visit company websiteJob Level
Mid-LevelSenior
Tech Stack
AWSCloudDockerEC2Python
About the role
- Design, improve, and operate MLOps pipelines for training, deploying, and managing ML models in production.
- Build and maintain CI/CD-style workflows for model packaging, versioning, and deployment across environments.
- Operate and optimise AWS-based infrastructure for AI workloads, including compute, storage, and networking components.
- Manage GPU-enabled workloads, addressing scalability, reliability, and cost-efficiency for high-load AI applications.
- Implement monitoring and alerting for deployed models, focusing on system health, performance, and operational stability.
- Own and evolve shared tooling such as MLflow, Docker-based workflows, and deployment frameworks to improve developer productivity.
- Work closely with infrastructure, SRE, and engineering teams to align AI platform practices with broader system standards.
- Support live AI services by diagnosing deployment, scaling, and infrastructure-related issues impacting AI features.
- Ensure reproducibility, traceability, and governance across the full ML lifecycle, from experimentation to production.
Requirements
- Hands-on experience building and operating MLOps pipelines for production ML systems.
- Strong experience with AWS services used for AI workloads, including EC2, ECS, and SageMaker.
- Practical experience with Docker and container-based deployment of ML workloads.
- Experience with MLflow or similar tools for experiment tracking, model versioning, and lifecycle management.
- Experience managing GPU-based workloads and addressing performance and cost challenges at scale.
- Strong understanding of cloud infrastructure concepts as they apply to ML systems.
- Ability to work with Python-based ML codebases to support deployment and lifecycle needs.
- Working familiarity with LLMs, NLP models, and applied ML concepts sufficient to support deployment and monitoring (without owning core model development).
- Proven experience supporting live, production ML systems with real customer impact.
- Ability to work cross-functionally with applied AI engineers, backend engineers, and infra teams.
Benefits
- Training and Development
- Discretionary Yearly Bonus & Salary Review
- Healthcare Coverage based on location
- 20 days Paid Annual Leave (excluding Bank holidays)
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
MLOpsCI/CDAWSDockerMLflowGPU managementPythonNLPLLMscloud infrastructure
Soft skills
cross-functional collaborationproblem-solvingcommunicationadaptabilityteamworkorganizational skillsanalytical thinkingattention to detailleadershipcustomer impact focus