
Senior Virtualization Validation Engineer
Crusoe
full-time
Posted on:
Location Type: Office
Location: San Francisco • California • United States
Visit company websiteExplore more
Salary
💰 $172,500 - $210,000 per year
Job Level
About the role
- Design and execute large-scale validation tests across multi-node virtualized clusters to ensure linear scaling and stability of GPU workloads.
- Validate high-speed interconnects—including NVLink, Infinity Fabric, InfiniBand, and RoCE—within virtualized environments to ensure low-latency, high-bandwidth communication.
- Lead the validation of QEMU and Cloud Hypervisor with a focus on PCIe passthrough (VFIO), IOMMU, and direct device assignment for GPUs and high-speed NICs.
- Architect and run comprehensive test suites using nccl-tests and rccl-tests (e.g., AllReduce, AllGather) to verify performance across node boundaries.
- Validate SR-IOV and RDMA configurations to ensure that virtualized guests achieve near-bare-metal networking performance for distributed GPU tasks.
- Develop and maintain automation frameworks in Python or Go to dynamically provision, configure, and stress-test multi-node virtualized environments.
- Perform deep-dive analysis of performance regressions in multi-node communication, identifying root causes across the guest OS, hypervisor, and physical fabric.
Requirements
- 2-5+ YOE demonstrated ability to competently and independently perform responsibilities plus Bachelor’s or Master’s degree in Computer Science, Electrical Engineering, or a related technical field.
- Proven experience with QEMU/KVM and Cloud Hypervisor in a production or research environment.
- Deep familiarity with NVIDIA (CUDA/NCCL) and/or AMD (ROCm/RCCL) stacks in a multi-node context.
- Strong understanding of RDMA, RoCE, and InfiniBand protocols and their implementation in virtualized systems.
- Expert-level knowledge of Linux kernel internals, specifically PCIe topology, VFIO, and memory management (HugePages, IOMMU).
- Advanced proficiency in Python and/or Bash for automating complex cluster-wide test scenarios.
- Experience with MNNVL (Multi-Node NVLink) or specialized AI fabric architectures.
- Familiarity with hardware-level debugging tools and performance profilers (e.g., NVIDIA Nsight, AMD Omniperf).
- Knowledge of containerized orchestration for GPUs (e.g., Kubernetes with specialized device plugins).
Benefits
- Restricted Stock Units are included in all offers
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
GPU workloadsQEMUCloud HypervisorPCIe passthroughIOMMUNVIDIA CUDANCCLAMD ROCmRDMALinux kernel internals