Tech Stack
Distributed SystemsGoGrafanaKubernetesPrometheusPythonTerraform
About the role
- We are seeking a highly skilled MLOps Engineer to design, build, and operate scalable machine learning infrastructure that supports modern AI applications.
- Build and operate robust data, embedding, and prompt pipelines to support production AI/ML workloads.
- Maintain a secure and scalable system for managing AI agent identity, versioning, and registration.
- Deliver automated workflows for model deployment and infrastructure provisioning using modern DevOps tooling.
- Design and implement primitives for distributed coordination and orchestration of ML agents and services.
- Implement observability, monitoring, and guarded execution frameworks to ensure safe and reliable AI system behavior.
Requirements
- Strong experience with MLOps, DevOps, or SRE practices in production environments.
- Hands-on expertise with CI/CD pipelines and Infrastructure as Code (Terraform, Pulumi, etc.).
- Solid understanding of data engineering, feature/embedding pipelines, and ML model deployment.
- Familiarity with observability tooling (Prometheus, Grafana, ELK, OpenTelemetry, etc.).
- Experience with distributed systems and coordination mechanisms (e.g., Kubernetes, service meshes, message queues).
- Proficiency in one or more languages: Python, Go, or similar.
- Bonus: Knowledge of LLM ops, prompt engineering infrastructure, or agent frameworks.