Tech Stack
AnsibleChefDistributed SystemsGRPCKubernetesLinuxNode.jsOpenStackPuppetPythonPyTorchTensorflow
About the role
- Lead strategy, definition, building, and operation of automated test infrastructure and methodologies for rack-based AI data center products.
- Define, develop, and lead execution of comprehensive test strategy covering functional, performance, scalability, reliability, security, and stress testing of control plane and data plane interactions.
- Architect and implement test methodologies to validate orchestration, provisioning, monitoring, and management of complex multi-node, rack-based AI systems.
- Design test environments to simulate large-scale data center operations and test resilience under load and failure scenarios.
- Design, develop, and maintain scalable test automation frameworks primarily using Python; build automated test suites interacting with APIs (REST, gMNI-gRPC), CLI, and UI.
- Integrate automated tests into CI/CD pipelines for rapid feedback and release quality assurance.
- Develop custom tools and harnesses to simulate managed devices, generate test data, and orchestrate complex test scenarios.
- Conduct deep-dive performance analysis, benchmarking, and bottleneck identification across CPU, GPU, memory, PCIe, network fabric, storage I/O, and power delivery.
- Analyze test data to identify trends, predict failures, and provide actionable insights for product improvements.
- Provide expert troubleshooting and root cause analysis, collaborating with software development, SRE, and data center operations teams.
- Participate in product design reviews and architectural discussions to ensure testability, reliability, and security from inception.
Requirements
- Bachelor's or Master's degree in Computer Science, Electrical Engineering, or a related technical field.
- 8+ years of progressive experience in system-level testing, validation, or QA for complex hardware/software integrated products, with a strong focus on data center or high-performance computing (HPC) environments.
- Proven experience in a lead or senior technical role, including mentoring junior engineers and defining test strategies.
- Expertise in developing robust test automation frameworks and scripts using Python.
- Deep understanding of rack-level system architectures, including servers (x86/ARM), GPUs, high-speed networking (Ethernet, InfiniBand), and enterprise storage (NVMe).
- Experience with performance benchmarking and tuning tools for CPU, GPU, network, and storage.
- Proficiency in Linux operating systems, including system administration and debugging.
- Strong analytical, problem-solving, and debugging skills for complex, distributed systems.
- Excellent communication and collaboration skills to work across multi-disciplinary teams.
- Preferred: Direct experience with AI/ML hardware platforms and their unique testing challenges.
- Preferred: Familiarity with AI frameworks (e.g., TensorFlow, PyTorch, JAX) and their resource utilization patterns.
- Preferred: Experience with orchestration tools (e.g., Kubernetes, Slurm, OpenStack, Ansible, Chef, Puppet) for managing compute resources.
- Preferred: Knowledge of data center power and thermal management principles.