Senior AI Workload Platform Engineer

Submer

full-time

Posted on: 3/24/2026

Location Type: Remote

Location: Anywhere in Europe

Visit company website

Explore more

Platform Engineer jobs

✨ AI Apply

Apply

Job Level

Senior

Tech Stack

Cloud Go Java Kubernetes Node.js Python

About the role

Design, build, and operate the compute orchestration layer powering a GPU-native cloud platform for AI and high-performance workloads.
Maintain the existing CloudStack code base used in current production deployments.
Integrate new upstream CloudStack releases into the internal platform fork.
Perform upgrades of existing customer environments to newer CloudStack versions.
Design and execute safe upgrade paths for running production environments.
Troubleshoot orchestration and provisioning issues in existing deployments.
Maintain and troubleshoot CloudStack VPC networking.
Work with and understand CloudStack Debian VPC routers.
Manage networking implementations based on Open vSwitch (OVS) and OVN.
Improve the reliability of network orchestration components.
Manage hypervisor implementations based on KVM and QEMU.
Maintain and evolve the code responsible for QEMU GPU passthrough, including PCI mapping and exposure of L40S, RTX 6000 Pro, and H200 GPUs to virtual machines.
Design orchestration and scheduling primitives for the next-generation platform based on Kubernetes, Slurm, and Argo Workflows.
Build orchestration workflows that expose GPU and CPU compute resources to platform users.
Implement Kubernetes scheduling strategies including GPU partitioning, Multi-GPU job placement, and Topology-aware scheduling for distributed training and inference.
Design and implement Kubernetes-based GPU/CPU scheduling infrastructure for multi-tenant AI workloads.
Design and operate Slurm-based HPC scheduling environments integrated with Kubernetes clusters.
Implement support for Multi-node distributed GPU training, Gang scheduling, and build automation for Dynamic Slurm node registration.
Design and implement workflow orchestration using Argo Workflows and develop reusable workflow templates for common platform workloads.

Requirements

Proven experience working with large-scale distributed compute environments at a neo-cloud, hyperscaler, or HPC provider.
Strong experience with CloudStack internals, including extending and maintaining platform functionality.
Experience operating cloud orchestration platforms in production environments.
Experience running GPU-heavy infrastructure for AI training, inference, or HPC workloads.
Experience maintaining or extending large Java codebases, ideally within infrastructure platforms.
Strong programming skills in Go and Python, with experience building cloud-native platform components.
Deep practical knowledge of Kubernetes internals and Slurm scheduling systems.
Familiarity with workflow orchestration systems such as Argo Workflows.
Familiar with virtual networking and distributed networking technologies such as OVS, OVN, VPC networking, RDMA, RoCE, ECMP, EVPN/VXLAN, and leaf-spine fabrics.
Understanding of GPU virtualization and passthrough mechanisms such as QEMU PCI passthrough and NVIDIA MIG.
Experience working with GPU infrastructure, including passthrough, NVIDIA MIG, scheduling, and lifecycle management of GPUs in distributed clusters.
Able to independently own major compute-orchestration initiatives from design through rollout and operational stabilization.
Comfortable mentoring peers and improving implementation quality, documentation, operational workflows, and platform reliability within the compute orchestration domain.

Benefits

Attractive compensation package reflecting your expertise and experience.
A great work environment characterised by friendliness, international diversity, flexibility, and a hybrid-friendly approach.
You'll be part of a fast-growing scale-up with a mission to make a positive impact, offering an exciting career evolution.

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

CloudStackKubernetesSlurmGoPythonQEMUOpen vSwitchGPU virtualizationJavaArgo Workflows

Soft Skills

mentoringindependent ownershipoperational stabilizationdocumentation improvementquality improvement