
Senior AI Workload Platform Engineer
Submer
full-time
Posted on:
Location Type: Remote
Location: Anywhere in Europe
Visit company websiteExplore more
Job Level
About the role
- Design, build, and operate the compute orchestration layer powering a GPU-native cloud platform for AI and high-performance workloads.
- Maintain the existing CloudStack code base used in current production deployments.
- Integrate new upstream CloudStack releases into the internal platform fork.
- Perform upgrades of existing customer environments to newer CloudStack versions.
- Design and execute safe upgrade paths for running production environments.
- Troubleshoot orchestration and provisioning issues in existing deployments.
- Maintain and troubleshoot CloudStack VPC networking.
- Work with and understand CloudStack Debian VPC routers.
- Manage networking implementations based on Open vSwitch (OVS) and OVN.
- Improve the reliability of network orchestration components.
- Manage hypervisor implementations based on KVM and QEMU.
- Maintain and evolve the code responsible for QEMU GPU passthrough, including PCI mapping and exposure of L40S, RTX 6000 Pro, and H200 GPUs to virtual machines.
- Design orchestration and scheduling primitives for the next-generation platform based on Kubernetes, Slurm, and Argo Workflows.
- Build orchestration workflows that expose GPU and CPU compute resources to platform users.
- Implement Kubernetes scheduling strategies including GPU partitioning, Multi-GPU job placement, and Topology-aware scheduling for distributed training and inference.
- Design and implement Kubernetes-based GPU/CPU scheduling infrastructure for multi-tenant AI workloads.
- Design and operate Slurm-based HPC scheduling environments integrated with Kubernetes clusters.
- Implement support for Multi-node distributed GPU training, Gang scheduling, and build automation for Dynamic Slurm node registration.
- Design and implement workflow orchestration using Argo Workflows and develop reusable workflow templates for common platform workloads.
Requirements
- Proven experience working with large-scale distributed compute environments at a neo-cloud, hyperscaler, or HPC provider.
- Strong experience with CloudStack internals, including extending and maintaining platform functionality.
- Experience operating cloud orchestration platforms in production environments.
- Experience running GPU-heavy infrastructure for AI training, inference, or HPC workloads.
- Experience maintaining or extending large Java codebases, ideally within infrastructure platforms.
- Strong programming skills in Go and Python, with experience building cloud-native platform components.
- Deep practical knowledge of Kubernetes internals and Slurm scheduling systems.
- Familiarity with workflow orchestration systems such as Argo Workflows.
- Familiar with virtual networking and distributed networking technologies such as OVS, OVN, VPC networking, RDMA, RoCE, ECMP, EVPN/VXLAN, and leaf-spine fabrics.
- Understanding of GPU virtualization and passthrough mechanisms such as QEMU PCI passthrough and NVIDIA MIG.
- Experience working with GPU infrastructure, including passthrough, NVIDIA MIG, scheduling, and lifecycle management of GPUs in distributed clusters.
- Able to independently own major compute-orchestration initiatives from design through rollout and operational stabilization.
- Comfortable mentoring peers and improving implementation quality, documentation, operational workflows, and platform reliability within the compute orchestration domain.
Benefits
- Attractive compensation package reflecting your expertise and experience.
- A great work environment characterised by friendliness, international diversity, flexibility, and a hybrid-friendly approach.
- You'll be part of a fast-growing scale-up with a mission to make a positive impact, offering an exciting career evolution.
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
CloudStackKubernetesSlurmGoPythonQEMUOpen vSwitchGPU virtualizationJavaArgo Workflows
Soft Skills
mentoringindependent ownershipoperational stabilizationdocumentation improvementquality improvement