Design and execute large-scale validation tests across multi-node virtualized clusters to ensure linear scaling and stability of GPU workloads.
Validate high-speed interconnects—including NVLink, Infinity Fabric, InfiniBand, and RoCE—within virtualized environments to ensure low-latency, high-bandwidth communication.
Lead the validation of QEMU and Cloud Hypervisor with a focus on PCIe passthrough (VFIO), IOMMU, and direct device assignment for GPUs and high-speed NICs.
Architect and run comprehensive test suites using nccl-tests and rccl-tests (e.g., AllReduce, AllGather) to verify performance across node boundaries.
Validate SR-IOV and RDMA configurations to ensure that virtualized guests achieve near-bare-metal networking performance for distributed GPU tasks.
Develop and maintain automation frameworks in Python or Go to dynamically provision, configure, and stress-test multi-node virtualized environments.
Perform deep-dive analysis of performance regressions in multi-node communication, identifying root causes across the guest OS, hypervisor, and physical fabric.

Requirements

2-5+ YOE demonstrated ability to competently and independently perform responsibilities plus Bachelor’s or Master’s degree in Computer Science, Electrical Engineering, or a related technical field.
Proven experience with QEMU/KVM and Cloud Hypervisor in a production or research environment.
Deep familiarity with NVIDIA (CUDA/NCCL) and/or AMD (ROCm/RCCL) stacks in a multi-node context.
Strong understanding of RDMA, RoCE, and InfiniBand protocols and their implementation in virtualized systems.
Expert-level knowledge of Linux kernel internals, specifically PCIe topology, VFIO, and memory management (HugePages, IOMMU).
Advanced proficiency in Python and/or Bash for automating complex cluster-wide test scenarios.
Experience with MNNVL (Multi-Node NVLink) or specialized AI fabric architectures.
Familiarity with hardware-level debugging tools and performance profilers (e.g., NVIDIA Nsight, AMD Omniperf).
Knowledge of containerized orchestration for GPUs (e.g., Kubernetes with specialized device plugins).

Benefits

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

GPU workloadsQEMUCloud HypervisorPCIe passthroughIOMMUNVIDIA CUDANCCLAMD ROCmRDMALinux kernel internals