MLOps Support Team Lead

CloudFactory

MLOps Support Team Lead at CloudFactory overseeing daily reliability and support of ML systems. Leading a global team to ensure operational maturity and incident management.

Posted 5/18/2026full-timeNairobi • 🇰🇪 KenyaSeniorWebsite

Tech Stack

Tools & technologies

AWSAzureCloudDockerGoogle Cloud PlatformGrafanaKubernetesPythonSQL

About the role

Key responsibilities & impact

Own the operational performance of all production ML systems and pipelines
Ensure reliability, availability, and supportability across client and internal MLOps workloads
Establish and enforce SLAs, SLOs, and operational standards
Act as the escalation point for major incidents and service degradation
Lead a global MLOps Support team (L1/L2) across regions (Colombia, Kenya, Nepal)
Define shift patterns, on-call rotations, and coverage models
Set clear expectations, performance metrics, and development plans
Foster a strong operational culture focused on accountability and continuous improvement
Own incident response processes, including triage, communication, and resolution
Ensure high-quality Root Cause Analysis (RCA) and follow-through on corrective actions
Drive reduction in repeat incidents through structured problem management
Improve time to detect (TTD) and time to resolve (TTR) metrics
Drive implementation and evolution of monitoring across:
- pipelines and data flows
- infrastructure and compute
- model performance and drift
Ensure visibility extends beyond system health to model accuracy, bias, and data integrity
Partner with Engineering to improve instrumentation, logging, and alerting
Define and evolve the MLOps support operating model
Clearly establish boundaries between Support, Engineering, and external partners
Build and maintain runbooks, playbooks, and escalation paths
Standardize intake, triage, and resolution workflows (e.g. Slack, ticketing systems)
Act as the primary operational interface for:
- Engineering teams
- Platform Operations
- External partners
Reduce reliance on individuals by formalizing ownership and knowledge sharing
Provide clear communication during incidents and service updates
Identify trends in incidents and operational inefficiencies
Drive improvements in:
- automation
- alert quality
- self-healing capabilities
Support onboarding of new MLOps projects into a standardized support model
Contribute to building MLOps as a scalable, repeatable service offering
Define and track key operational metrics:
- incident volume and severity
- SLA adherence
- system uptime and reliability
Support regular service reviews and model health reporting
Provide leadership visibility into risks, trends, and improvement areas

Requirements

What you’ll need

Proven experience in operations leadership, SRE, DevOps, or platform support environments
Strong understanding of production support models, incident management, and escalation frameworks
Experience leading or mentoring technical support or operations teams
Working knowledge of ML systems in production, including:
- pipelines and batch processing
- model lifecycle and deployment
- common failure modes
Strong analytical and troubleshooting skills in complex environments
Experience with monitoring and observability tools
Proficiency in:
- SQL
- Python or scripting (Bash)
Ability to operate in a high-pressure, incident-driven environment while maintaining structure and clarity
Strong stakeholder management and communication skills
Nice To Have skills (Preferred)
Experience supporting AI/ML platforms at scale
Familiarity with tools such as:
- Databricks
- MLflow
- Grafana
- Power BI
- New Relic
Exposure to model monitoring (drift, bias, performance validation)
Experience working with external partners or vendors in delivery models
Understanding of cloud platforms (AWS, GCP, Azure)
Experience with containerized environments (Docker / Kubernetes)
Background in building or scaling support functions from early-stage to maturity
Strong service ownership mindset — takes accountability for outcomes, not just activity
Calm, structured, and decisive during incidents
Ability to balance operational delivery with strategic improvement
Passion for building reliable, trustworthy AI/ML systems
Highly collaborative across Engineering, Platform, and Delivery teams
Focus on reducing risk related to:
- model performance
- bias
- data integrity
Commitment to documentation, knowledge sharing, and eliminating single points of failure

Benefits

Comp & perks

At CloudFactory, we believe that work should be more than just a job, it should be a platform for growth, impact, and community. Here, you’ll earn with purpose, learn every day, and serve a mission that truly matters. If you're looking for a career where you can develop professionally, contribute meaningfully, and be part of a global movement, we’d love to have you on this journey!
Join us today and be part of our mission to connect people and technology for a better world! Apply now and bring your whole, authentic self to work. We can’t wait to meet you!

ATS Keywords

✓ Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

SQLPythonBashML systemspipelinesbatch processingmodel lifecycleincident managementmonitoring toolsobservability tools

Soft Skills

operations leadershipanalytical skillstroubleshooting skillsstakeholder managementcommunication skillsservice ownership mindsetcalm under pressurestructured decision-makingcollaborationaccountability