FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.

Machine Learning Operations Engineer
NuveiMLOps Engineer at Nuvei designing reliable CI/CD and operating AI products for fraud scoring. Collaborating with data scientists and engineers to ensure operational excellence in ML systems.
Tech Stack
Tools & technologiesAWSAzureCloudDockerGoogle Cloud PlatformGrafanaKubernetesMicroservicesPrometheusPythonRaySparkTerraformUnity
About the role
Key responsibilities & impact- Operate & Develop ML/LLM platforms on Kubernetes + cloud (Azure; AWS/GCP ok) with Docker, Terraform, and other relevant tools
- Manage object storage, GPUs, and autoscaling for training & low-latency model serving
- Manage cloud environment, networking, service mesh, secrets, and policies to meet PCI-DSS and data-residency requirements
- Build end-to-end CI/CD for models/agents/MCP tooling (versioning, tests, approvals)
- Deliver real-time fraud/risk scoring & agent signals under strict latency SLOs.
- Maintain MCP servers/clients: tool/resource definitions, versioning, quotas, isolation, access controls
- Integrate agents with microservices, event streams, and rule engines; provide SLAs, tracing, and on-call runbooks
- Measure operational metrics of ML/LLM (latency, throughput, cost, tokens, tool success, safety events)
- Enforce governance: RBAC/ABAC, row-level security, encryption, PII/secrets management, audit trails.
- Partner with DS on packaging (wheels/conda/containers), feature contracts, and reproducible experiments.
- lead incident response and post-mortems.
- Drive FinOps: right-sizing, GPU utilization, batching/caching, budget alerts.
Requirements
What you’ll need- 4+ years in DevOps/MLOps/Platform roles building and operating production ML systems (batch and real-time)
- Strong hands-on with Kubernetes, Docker, Terraform/IaC, and CI/CD
- Practical experience with Spark/Databricks and scalable data processing
- Proficiency in Python & Bash
- Ability to operate DS code and optimize runtime performance.
- Experience with model registries (MLflow or similar), experiment tracking, and artifact management.
- Production model serving using FastAPI/Ray Serve/Triton/TorchServe, including autoscaling and rollout strategies
- Monitoring and tracing with Prometheus/Grafana/OpenTelemetry; alerting tied to SLOs/SLAs
- Solid understanding of PCI-DSS/GDPR considerations for data and ML systems
- Experience with the Azure cloud environment is a big plus
- Operating LLM/agent workloads in production (prompt/config versioning, tool execution reliability, fallback/retry policies)
- Building/maintaining RAG stacks (indexing pipelines, vector DBs, retrieval evaluation, hybrid search)
- Implementing guardrails (policy checks, content filters, allow/deny lists) and human-in-the-loop workflows
- Experience with feature stores - Qwak Feature Store, Feast
- A/B testing for models and agents, offline/online evaluation frameworks
- Payments/fraud/risk domain experience; integrating ML outputs with rule engines and operational systems - Advantage
- Familiarity with Databricks Unity Catalog, dbt, or similar tooling
Benefits
Comp & perks- Private Medical Insurance
- Office and home hybrid working
- Global bonus plan
- Volunteering programs
- Prime location office close to Tel Aviv train station
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
KubernetesDockerTerraformCI/CDPythonBashSparkDatabricksFastAPIMLflow
Soft Skills
incident responsepost-mortemscollaboration
Certifications
PCI-DSSGDPR