FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.

MLOps Support Team Lead
CloudFactoryMLOps Support Team Lead at CloudFactory overseeing daily reliability and support of ML systems. Leading a global team to ensure operational maturity and incident management.
Tech Stack
Tools & technologiesAWSAzureCloudDockerGoogle Cloud PlatformGrafanaKubernetesPythonSQL
About the role
Key responsibilities & impact- Own the operational performance of all production ML systems and pipelines
- Ensure reliability, availability, and supportability across client and internal MLOps workloads
- Establish and enforce SLAs, SLOs, and operational standards
- Act as the escalation point for major incidents and service degradation
- Lead a global MLOps Support team (L1/L2) across regions (Colombia, Kenya, Nepal)
- Define shift patterns, on-call rotations, and coverage models
- Set clear expectations, performance metrics, and development plans
- Foster a strong operational culture focused on accountability and continuous improvement
- Own incident response processes, including triage, communication, and resolution
- Ensure high-quality Root Cause Analysis (RCA) and follow-through on corrective actions
- Drive reduction in repeat incidents through structured problem management
- Improve time to detect (TTD) and time to resolve (TTR) metrics
- Drive implementation and evolution of monitoring across:
- - pipelines and data flows
- - infrastructure and compute
- - model performance and drift
- Ensure visibility extends beyond system health to model accuracy, bias, and data integrity
- Partner with Engineering to improve instrumentation, logging, and alerting
- Define and evolve the MLOps support operating model
- Clearly establish boundaries between Support, Engineering, and external partners
- Build and maintain runbooks, playbooks, and escalation paths
- Standardize intake, triage, and resolution workflows (e.g. Slack, ticketing systems)
- Act as the primary operational interface for:
- - Engineering teams
- - Platform Operations
- - External partners
- Reduce reliance on individuals by formalizing ownership and knowledge sharing
- Provide clear communication during incidents and service updates
- Identify trends in incidents and operational inefficiencies
- Drive improvements in:
- - automation
- - alert quality
- - self-healing capabilities
- Support onboarding of new MLOps projects into a standardized support model
- Contribute to building MLOps as a scalable, repeatable service offering
- Define and track key operational metrics:
- - incident volume and severity
- - SLA adherence
- - system uptime and reliability
- Support regular service reviews and model health reporting
- Provide leadership visibility into risks, trends, and improvement areas
Requirements
What you’ll need- Proven experience in operations leadership, SRE, DevOps, or platform support environments
- Strong understanding of production support models, incident management, and escalation frameworks
- Experience leading or mentoring technical support or operations teams
- Working knowledge of ML systems in production, including:
- - pipelines and batch processing
- - model lifecycle and deployment
- - common failure modes
- Strong analytical and troubleshooting skills in complex environments
- Experience with monitoring and observability tools
- Proficiency in:
- - SQL
- - Python or scripting (Bash)
- Ability to operate in a high-pressure, incident-driven environment while maintaining structure and clarity
- Strong stakeholder management and communication skills
- Nice To Have skills (Preferred)
- Experience supporting AI/ML platforms at scale
- Familiarity with tools such as:
- - Databricks
- - MLflow
- - Grafana
- - Power BI
- - New Relic
- Exposure to model monitoring (drift, bias, performance validation)
- Experience working with external partners or vendors in delivery models
- Understanding of cloud platforms (AWS, GCP, Azure)
- Experience with containerized environments (Docker / Kubernetes)
- Background in building or scaling support functions from early-stage to maturity
- Strong service ownership mindset — takes accountability for outcomes, not just activity
- Calm, structured, and decisive during incidents
- Ability to balance operational delivery with strategic improvement
- Passion for building reliable, trustworthy AI/ML systems
- Highly collaborative across Engineering, Platform, and Delivery teams
- Focus on reducing risk related to:
- - model performance
- - bias
- - data integrity
- Commitment to documentation, knowledge sharing, and eliminating single points of failure
Benefits
Comp & perks- At CloudFactory, we believe that work should be more than just a job, it should be a platform for growth, impact, and community. Here, you’ll earn with purpose, learn every day, and serve a mission that truly matters. If you're looking for a career where you can develop professionally, contribute meaningfully, and be part of a global movement, we’d love to have you on this journey!
- Join us today and be part of our mission to connect people and technology for a better world! Apply now and bring your whole, authentic self to work. We can’t wait to meet you!
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
SQLPythonBashML systemspipelinesbatch processingmodel lifecycleincident managementmonitoring toolsobservability tools
Soft Skills
operations leadershipanalytical skillstroubleshooting skillsstakeholder managementcommunication skillsservice ownership mindsetcalm under pressurestructured decision-makingcollaborationaccountability