Submer

Senior AI Workload Platform Engineer

Submer

full-time

Posted on:

Location Type: Remote

Location: Anywhere in Europe

Visit company website

Explore more

AI Apply
Apply

Job Level

About the role

  • Design, build, and operate the compute orchestration layer powering a GPU-native cloud platform for AI and high-performance workloads.
  • Maintain the existing CloudStack code base used in current production deployments.
  • Integrate new upstream CloudStack releases into the internal platform fork.
  • Perform upgrades of existing customer environments to newer CloudStack versions.
  • Design and execute safe upgrade paths for running production environments.
  • Troubleshoot orchestration and provisioning issues in existing deployments.
  • Maintain and troubleshoot CloudStack VPC networking.
  • Work with and understand CloudStack Debian VPC routers.
  • Manage networking implementations based on Open vSwitch (OVS) and OVN.
  • Improve the reliability of network orchestration components.
  • Manage hypervisor implementations based on KVM and QEMU.
  • Maintain and evolve the code responsible for QEMU GPU passthrough, including PCI mapping and exposure of L40S, RTX 6000 Pro, and H200 GPUs to virtual machines.
  • Design orchestration and scheduling primitives for the next-generation platform based on Kubernetes, Slurm, and Argo Workflows.
  • Build orchestration workflows that expose GPU and CPU compute resources to platform users.
  • Implement Kubernetes scheduling strategies including GPU partitioning, Multi-GPU job placement, and Topology-aware scheduling for distributed training and inference.
  • Design and implement Kubernetes-based GPU/CPU scheduling infrastructure for multi-tenant AI workloads.
  • Design and operate Slurm-based HPC scheduling environments integrated with Kubernetes clusters.
  • Implement support for Multi-node distributed GPU training, Gang scheduling, and build automation for Dynamic Slurm node registration.
  • Design and implement workflow orchestration using Argo Workflows and develop reusable workflow templates for common platform workloads.

Requirements

  • Proven experience working with large-scale distributed compute environments at a neo-cloud, hyperscaler, or HPC provider.
  • Strong experience with CloudStack internals, including extending and maintaining platform functionality.
  • Experience operating cloud orchestration platforms in production environments.
  • Experience running GPU-heavy infrastructure for AI training, inference, or HPC workloads.
  • Experience maintaining or extending large Java codebases, ideally within infrastructure platforms.
  • Strong programming skills in Go and Python, with experience building cloud-native platform components.
  • Deep practical knowledge of Kubernetes internals and Slurm scheduling systems.
  • Familiarity with workflow orchestration systems such as Argo Workflows.
  • Familiar with virtual networking and distributed networking technologies such as OVS, OVN, VPC networking, RDMA, RoCE, ECMP, EVPN/VXLAN, and leaf-spine fabrics.
  • Understanding of GPU virtualization and passthrough mechanisms such as QEMU PCI passthrough and NVIDIA MIG.
  • Experience working with GPU infrastructure, including passthrough, NVIDIA MIG, scheduling, and lifecycle management of GPUs in distributed clusters.
  • Able to independently own major compute-orchestration initiatives from design through rollout and operational stabilization.
  • Comfortable mentoring peers and improving implementation quality, documentation, operational workflows, and platform reliability within the compute orchestration domain.
Benefits
  • Attractive compensation package reflecting your expertise and experience.
  • A great work environment characterised by friendliness, international diversity, flexibility, and a hybrid-friendly approach.
  • You'll be part of a fast-growing scale-up with a mission to make a positive impact, offering an exciting career evolution.
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
CloudStackKubernetesSlurmGoPythonQEMUOpen vSwitchGPU virtualizationJavaArgo Workflows
Soft Skills
mentoringindependent ownershipoperational stabilizationdocumentation improvementquality improvement