Apply

Ready to go for it?

AI Apply speeds things up—apply directly if you prefer.

FREE ACCESS
5,000–10,000 jobs/day
JobTailor Logo

See all jobs on JobTailor

Search thousands of fresh jobs every day.

Discover
  • Fresh listings
  • Fast filters
  • No subscription required
Create a free account and start exploring right away.
CloudFactory

MLOps Support Team Lead

CloudFactory

MLOps Support Team Lead at CloudFactory overseeing daily reliability and support of ML systems. Leading a global team to ensure operational maturity and incident management.

Posted 5/18/2026full-timeNairobi • 🇰🇪 KenyaSeniorWebsite

Tech Stack

Tools & technologies
AWSAzureCloudDockerGoogle Cloud PlatformGrafanaKubernetesPythonSQL

About the role

Key responsibilities & impact
  • Own the operational performance of all production ML systems and pipelines
  • Ensure reliability, availability, and supportability across client and internal MLOps workloads
  • Establish and enforce SLAs, SLOs, and operational standards
  • Act as the escalation point for major incidents and service degradation
  • Lead a global MLOps Support team (L1/L2) across regions (Colombia, Kenya, Nepal)
  • Define shift patterns, on-call rotations, and coverage models
  • Set clear expectations, performance metrics, and development plans
  • Foster a strong operational culture focused on accountability and continuous improvement
  • Own incident response processes, including triage, communication, and resolution
  • Ensure high-quality Root Cause Analysis (RCA) and follow-through on corrective actions
  • Drive reduction in repeat incidents through structured problem management
  • Improve time to detect (TTD) and time to resolve (TTR) metrics
  • Drive implementation and evolution of monitoring across:
  • - pipelines and data flows
  • - infrastructure and compute
  • - model performance and drift
  • Ensure visibility extends beyond system health to model accuracy, bias, and data integrity
  • Partner with Engineering to improve instrumentation, logging, and alerting
  • Define and evolve the MLOps support operating model
  • Clearly establish boundaries between Support, Engineering, and external partners
  • Build and maintain runbooks, playbooks, and escalation paths
  • Standardize intake, triage, and resolution workflows (e.g. Slack, ticketing systems)
  • Act as the primary operational interface for:
  • - Engineering teams
  • - Platform Operations
  • - External partners
  • Reduce reliance on individuals by formalizing ownership and knowledge sharing
  • Provide clear communication during incidents and service updates
  • Identify trends in incidents and operational inefficiencies
  • Drive improvements in:
  • - automation
  • - alert quality
  • - self-healing capabilities
  • Support onboarding of new MLOps projects into a standardized support model
  • Contribute to building MLOps as a scalable, repeatable service offering
  • Define and track key operational metrics:
  • - incident volume and severity
  • - SLA adherence
  • - system uptime and reliability
  • Support regular service reviews and model health reporting
  • Provide leadership visibility into risks, trends, and improvement areas

Requirements

What you’ll need
  • Proven experience in operations leadership, SRE, DevOps, or platform support environments
  • Strong understanding of production support models, incident management, and escalation frameworks
  • Experience leading or mentoring technical support or operations teams
  • Working knowledge of ML systems in production, including:
  • - pipelines and batch processing
  • - model lifecycle and deployment
  • - common failure modes
  • Strong analytical and troubleshooting skills in complex environments
  • Experience with monitoring and observability tools
  • Proficiency in:
  • - SQL
  • - Python or scripting (Bash)
  • Ability to operate in a high-pressure, incident-driven environment while maintaining structure and clarity
  • Strong stakeholder management and communication skills
  • Nice To Have skills (Preferred)
  • Experience supporting AI/ML platforms at scale
  • Familiarity with tools such as:
  • - Databricks
  • - MLflow
  • - Grafana
  • - Power BI
  • - New Relic
  • Exposure to model monitoring (drift, bias, performance validation)
  • Experience working with external partners or vendors in delivery models
  • Understanding of cloud platforms (AWS, GCP, Azure)
  • Experience with containerized environments (Docker / Kubernetes)
  • Background in building or scaling support functions from early-stage to maturity
  • Strong service ownership mindset — takes accountability for outcomes, not just activity
  • Calm, structured, and decisive during incidents
  • Ability to balance operational delivery with strategic improvement
  • Passion for building reliable, trustworthy AI/ML systems
  • Highly collaborative across Engineering, Platform, and Delivery teams
  • Focus on reducing risk related to:
  • - model performance
  • - bias
  • - data integrity
  • Commitment to documentation, knowledge sharing, and eliminating single points of failure

Benefits

Comp & perks
  • At CloudFactory, we believe that work should be more than just a job, it should be a platform for growth, impact, and community. Here, you’ll earn with purpose, learn every day, and serve a mission that truly matters. If you're looking for a career where you can develop professionally, contribute meaningfully, and be part of a global movement, we’d love to have you on this journey!
  • Join us today and be part of our mission to connect people and technology for a better world! Apply now and bring your whole, authentic self to work. We can’t wait to meet you!

ATS Keywords

✓ Tailor your resume
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
SQLPythonBashML systemspipelinesbatch processingmodel lifecycleincident managementmonitoring toolsobservability tools
Soft Skills
operations leadershipanalytical skillstroubleshooting skillsstakeholder managementcommunication skillsservice ownership mindsetcalm under pressurestructured decision-makingcollaborationaccountability