Crusoe

Senior Virtualization Validation Engineer

Crusoe

full-time

Posted on:

Location Type: Office

Location: San FranciscoCaliforniaUnited States

Visit company website

Explore more

AI Apply
Apply

Salary

💰 $172,500 - $210,000 per year

Job Level

About the role

  • Design and execute large-scale validation tests across multi-node virtualized clusters to ensure linear scaling and stability of GPU workloads.
  • Validate high-speed interconnects—including NVLink, Infinity Fabric, InfiniBand, and RoCE—within virtualized environments to ensure low-latency, high-bandwidth communication.
  • Lead the validation of QEMU and Cloud Hypervisor with a focus on PCIe passthrough (VFIO), IOMMU, and direct device assignment for GPUs and high-speed NICs.
  • Architect and run comprehensive test suites using nccl-tests and rccl-tests (e.g., AllReduce, AllGather) to verify performance across node boundaries.
  • Validate SR-IOV and RDMA configurations to ensure that virtualized guests achieve near-bare-metal networking performance for distributed GPU tasks.
  • Develop and maintain automation frameworks in Python or Go to dynamically provision, configure, and stress-test multi-node virtualized environments.
  • Perform deep-dive analysis of performance regressions in multi-node communication, identifying root causes across the guest OS, hypervisor, and physical fabric.

Requirements

  • 2-5+ YOE demonstrated ability to competently and independently perform responsibilities plus Bachelor’s or Master’s degree in Computer Science, Electrical Engineering, or a related technical field.
  • Proven experience with QEMU/KVM and Cloud Hypervisor in a production or research environment.
  • Deep familiarity with NVIDIA (CUDA/NCCL) and/or AMD (ROCm/RCCL) stacks in a multi-node context.
  • Strong understanding of RDMA, RoCE, and InfiniBand protocols and their implementation in virtualized systems.
  • Expert-level knowledge of Linux kernel internals, specifically PCIe topology, VFIO, and memory management (HugePages, IOMMU).
  • Advanced proficiency in Python and/or Bash for automating complex cluster-wide test scenarios.
  • Experience with MNNVL (Multi-Node NVLink) or specialized AI fabric architectures.
  • Familiarity with hardware-level debugging tools and performance profilers (e.g., NVIDIA Nsight, AMD Omniperf).
  • Knowledge of containerized orchestration for GPUs (e.g., Kubernetes with specialized device plugins).
Benefits
  • Restricted Stock Units are included in all offers
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
GPU workloadsQEMUCloud HypervisorPCIe passthroughIOMMUNVIDIA CUDANCCLAMD ROCmRDMALinux kernel internals