Tech Stack
AWSAzureCloudFluxGoogle Cloud PlatformGrafanaKubernetesPostgresPythonRabbitMQRedis
About the role
- Support and enhance our Kubernetes-based infrastructure in cloud environments, running both ML/LLM workloads and general applications
- Deploy and optimize LLM inference systems
- Design, build, and improve MLOps/DevOps pipelines to support the entire development lifecycle
- Manage GPU scheduling and autoscaling with Kubernetes-native tooling
- Ensure observability and alerting across the platform
- Operate and troubleshoot supporting infrastructure
- Contribute to platform reliability, security, and performance through automation and best practices
Requirements
- 5+ years of experience in MLOps or SRE
- Strong hands-on Kubernetes experience, including GitOps (Flux or ArgoCD), Kustomize, Helm and production troubleshooting
- Familiarity with LLM inference deployment and optimization in Kubernetes (e.g., vLLM, LMCache, llm-d)
- Experience with MLOps supporting tools such as MLflow or Argo Workflows
- Understanding of GPU resource orchestration in Kubernetes environments
- Profound knowledge of observability tools, such as VictoriaMetrics, VictoriaLogs and Grafana
- Knowledge of database and broker administration (PostgreSQL, Redis and RabbitMQ)
- Solid scripting skills in Python
- Comfortable working with cloud platforms (OVHcloud, AWS, GCP or Azure)
- Full ownership of a mission-critical platform
- A team that values curiosity, learning, and experimentation
- Remote-first setup with the option to work in our Berlin office
- Competitive salary depending on experience
- Work on AI infrastructure that directly impacts healthcare innovation
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
KubernetesMLOpsSREGitOpsKustomizeHelmLLM inferencePythonobservabilityGPU orchestration