
MLOps Engineer
Aerones
full-time
Posted on:
Location Type: Hybrid
Location: Riga • 🇱🇻 Latvia
Visit company websiteSalary
💰 €2,500 - €5,500 per month
Job Level
Mid-LevelSenior
Tech Stack
AirflowAWSAzureCloudDockerGoogle Cloud PlatformGrafanaKubernetesPrometheusPythonPyTorchRayTerraform
About the role
- Own the end-to-end ML pipeline for computer vision: data prep, training, evaluation, model packaging, artifact/version management, deployment, and monitoring (local GPU cluster + GCP).
- Design and maintain containerized workflows for multi-GPU training and distributed workloads (e.g., PyTorch DDP, Ray, or similar).
- Build and operate orchestration (e.g., Airflow/Argo/Kubeflow/Ray Jobs) for scheduled and on-demand pipelines across on-prem and cloud.
- Implement and tune resource allocation strategies based on current and upcoming task queues (GPU/CPU/memory-aware scheduling; preemption/priority; autoscaling).
- Introduce and integrate monitoring/telemetry for:
- - job health and failure analysis (retry, backoff, alerts),
- - data/feature drift and model performance (precision/recall, latency, throughput),
- - infra metrics (GPU utilization, memory, I/O, cost).
- Harden GCP environments (permissions, networks, registries, storage) and optimize for reliability, performance, and cost (spot/managed instance groups, autoscaling).
- Establish model governance: experiment tracking, model registry, promotion gates, rollbacks, and audit trails.
- Standardize CI/CD for ML (data/feature pipelines, model builds, tests, and canary/blue-green rollouts).
- Collaborate with CV researchers/engineers to productionize new models and improve training throughput & inference SLAs.
- Continuously improve documentation: update existing pipeline docs and produce concise runbooks, diagrams, and “how-to” guides.
Requirements
- Hands-on MLOps experience building and running ML pipelines at scale (preferably computer vision) across on-prem GPUs and a public cloud (GCP preferred).
- Strong with Docker and Docker Compose in local and cloud environments; solid understanding of image build optimization and artifact caching.
- GitLab CI/CD expertise (modular templates, YAML optimization, build/test stages for ML, environment promotion).
- Proficiency with Python and Bash for pipeline tooling, glue code, and automation; Terraform for infra-as-code (GCP resources, IAM, networking, storage).
- Experience with orchestration: one or more of Airflow, Argo Workflows, Kubeflow, Ray, or Prefect.
- Experience operating GPU workloads: NVIDIA driver/CUDA stack, container runtimes, device plugins (k8s), multi-GPU training, utilization tuning.
- Observability & monitoring for ML and infra: Prometheus/Grafana, OpenTelemetry/Loki (or similar) for metrics, logs, traces; alerting and SLOs.
- Experiment tracking / model registry with tools like MLflow or Weights & Biases (runs, params, artifacts, metrics, registry/promotion).
- Data versioning & validation: DVC/lakeFS (or similar), Great Expectations/whylogs, schema checks, drift detection.
- Cloud services: GCP (Compute Engine, GKE or Autopilot, Cloud Run, Artifact Registry, Cloud Storage, Pub/Sub). Equivalent AWS/Azure experience is acceptable.
- Security & compliance for ML stacks: secrets management, SBOM/image scanning, least-privilege IAM, network policies, artifact signing.
- Solid understanding of containerized deployment patterns (blue-green/canary), rollout strategies, and rollback safety.
Benefits
- Salary from **2,500 EUR to 5,500 EUR per month** (before Taxes)
- A Birthday Gift
- **After Probationary Period **
- **Health Insurance**
- **Health Recovery Days **(which can be taken as you need)
- Paid **Study Leave**
- Funding for the purchase of **Vision Glasses **after one (1) year of service
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
MLOpsML pipelinescomputer visionDockerGitLab CI/CDPythonBashTerraformorchestrationdata versioning
Soft skills
collaborationdocumentationproblem-solvingcommunicationorganization