Salary
💰 $168,000 - $310,500 per year
Tech Stack
LinuxPythonShell ScriptingUnix
About the role
- Design and implement novel test plans, tools, and automation frameworks to validate GPU functionality, performance, and reliability in complex datacenter environments
- Develop stress tests and methodologies to detect, characterize, and eliminate silent data errors
- Partner with architecture and silicon construction teams to influence system and chip-level features that improve diagnostics, debuggability, and root-cause analysis
- Analyze test results, investigate complex failures, and drive solutions with design, firmware, and software teams
- Provide technical leadership, guide junior engineers, and shape validation strategy across datacenter product lines
Requirements
- BS/MS in Electrical Engineering, Computer Engineering, Computer Science, or related field (or equivalent experience)
- 8+ years of experience in hardware validation, test development, or datacenter hardware engineering
- Expert programming skills in Python and/or C/C++ for automation and tool development
- Deep Linux/Unix expertise, including advanced shell scripting
- Strong knowledge of server architecture: CPUs, GPUs, PCIe, networking, and storage
- Hard-working, proactive approach with ability to own and deliver complex projects
- (Ways to stand out) Hands-on experience with NVIDIA GPU architecture (Hopper, Ampere) and software stack (CUDA, NCCL)
- Experience testing high-speed interconnects such as NVLink or InfiniBand
- Familiarity with AI/ML or HPC benchmarking and stress-testing tools
- Proven track record of identifying and resolving critical bugs in pre-production hardware