
MLOps Support Engineer
CloudFactory
full-time
Posted on:
Location Type: Hybrid
Location: Medellín • Colombia
Visit company websiteExplore more
About the role
- Provide Tier 1 / Tier 2 operational support for AI/ML solutions.
- Identify failed jobs, degraded pipelines, or performance anomalies.
- Triage incidents, investigate issues, and coordinate escalation to Tier 3 Engineering.
- Participate in on-call rotas once established.
- Validate that pipelines and jobs complete successfully.
- Monitor data pipeline health, model execution, and basic performance metrics.
- Identify operational issues before they impact customers
- Respond or alert customers when there has been an outage or issue with one of their models.
- Support incident management, rollback, and recovery activities.
- Use and maintain runbooks and operational documentation.
- Work with Engineering to improve supportability and observability.
- Contribute to knowledge sharing to reduce single points of failure.
- Work within defined SLAs and support processes as the service matures
- Build quarterly business reviews to provide updates on the health of the ML Models.
- Evaluate champion/challenger models to see if a new model should be promoted.
- Monitor for model drift and performance degradation, while validating that updates (new champion models or added data) do not introduce bias.
Requirements
- Experience in operations, DevOps, SRE, or platform support roles.
- Strong troubleshooting skills in production environments.
- Proficiency in SQL and scripting (Python, Bash) for developing and automating ML workflows.
- Familiarity with Cloud-hosted systems (AWS, GCP, Azure) for cloud-based ML services.
- Git: Solid understanding of version control, particularly in collaborative development environments.
- Comfortable working from runbooks and structured processes.
- Exposure to AI/ML systems in production.
- Familiarity with monitoring and observability tools (Grafana, PowerBI, New Relic).
- Knowledge of MLOps tooling and data platforms (ML FLow, Databricks)
- Experience supporting customer-facing platforms.
- Knowledge of containerization (Kubernetes) is a plus.
- Experience of LLM Prompt Engineering and troubleshooting
- Early career in MLOps or ML Engineering.
- Someone who is eager to learn about complex predictive models.
- Background in computer science, informatics, or related fields
- Passion for Machine Learning and AI: An eager learner who is excited about working with cutting-edge ML technologies and is passionate about optimizing and maintaining ML models in production environments.
- Early Career in MLOps or ML Engineering: Ideally, Junior ML Engineer with a strong desire to grow in the field of MLOps and AI operations.
- A Collaborative Mindset: You thrive in a team setting and are ready to contribute to model improvement, A/B testing, and iterative development.
- Attention to Detail: A focus on model performance, bias prevention, and ensuring optimal model behavior as new data and models are introduced.
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
SQLPythonBashMLOpsmodel drift monitoringperformance metricsincident managementtroubleshootingcontainerizationA/B testing
Soft skills
troubleshooting skillscollaborative mindsetattention to detaileager learnerknowledge sharing