CloudFactory

MLOps Support Engineer

CloudFactory

full-time

Posted on:

Location Type: Hybrid

Location: MedellínColombia

Visit company website

Explore more

AI Apply
Apply

About the role

  • Provide Tier 1 / Tier 2 operational support for AI/ML solutions.
  • Identify failed jobs, degraded pipelines, or performance anomalies.
  • Triage incidents, investigate issues, and coordinate escalation to Tier 3 Engineering.
  • Participate in on-call rotas once established.
  • Validate that pipelines and jobs complete successfully.
  • Monitor data pipeline health, model execution, and basic performance metrics.
  • Identify operational issues before they impact customers
  • Respond or alert customers when there has been an outage or issue with one of their models.
  • Support incident management, rollback, and recovery activities.
  • Use and maintain runbooks and operational documentation.
  • Work with Engineering to improve supportability and observability.
  • Contribute to knowledge sharing to reduce single points of failure.
  • Work within defined SLAs and support processes as the service matures
  • Build quarterly business reviews to provide updates on the health of the ML Models.
  • Evaluate champion/challenger models to see if a new model should be promoted.
  • Monitor for model drift and performance degradation, while validating that updates (new champion models or added data) do not introduce bias.

Requirements

  • Experience in operations, DevOps, SRE, or platform support roles.
  • Strong troubleshooting skills in production environments.
  • Proficiency in SQL and scripting (Python, Bash) for developing and automating ML workflows.
  • Familiarity with Cloud-hosted systems (AWS, GCP, Azure) for cloud-based ML services.
  • Git: Solid understanding of version control, particularly in collaborative development environments.
  • Comfortable working from runbooks and structured processes.
  • Exposure to AI/ML systems in production.
  • Familiarity with monitoring and observability tools (Grafana, PowerBI, New Relic).
  • Knowledge of MLOps tooling and data platforms (ML FLow, Databricks)
  • Experience supporting customer-facing platforms.
  • Knowledge of containerization (Kubernetes) is a plus.
  • Experience of LLM Prompt Engineering and troubleshooting
  • Early career in MLOps or ML Engineering.
  • Someone who is eager to learn about complex predictive models.
  • Background in computer science, informatics, or related fields
  • Passion for Machine Learning and AI: An eager learner who is excited about working with cutting-edge ML technologies and is passionate about optimizing and maintaining ML models in production environments.
  • Early Career in MLOps or ML Engineering: Ideally, Junior ML Engineer with a strong desire to grow in the field of MLOps and AI operations.
  • A Collaborative Mindset: You thrive in a team setting and are ready to contribute to model improvement, A/B testing, and iterative development.
  • Attention to Detail: A focus on model performance, bias prevention, and ensuring optimal model behavior as new data and models are introduced.

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills
SQLPythonBashMLOpsmodel drift monitoringperformance metricsincident managementtroubleshootingcontainerizationA/B testing
Soft skills
troubleshooting skillscollaborative mindsetattention to detaileager learnerknowledge sharing