Lead Test Orchestrator, Management Engineer

Celestica

full-time

Posted on: 9/25/2025

Origin: • 🇺🇸 United States

Visit company website

✨ AI Apply

Apply

Job Level

Senior

Tech Stack

AnsibleChefDistributed SystemsGRPCKubernetesLinuxNode.jsOpenStackPuppetPythonPyTorchTensorflow

About the role

Lead strategy, definition, building, and operation of automated test infrastructure and methodologies for rack-based AI data center products.
Define, develop, and lead execution of comprehensive test strategy covering functional, performance, scalability, reliability, security, and stress testing of control plane and data plane interactions.
Architect and implement test methodologies to validate orchestration, provisioning, monitoring, and management of complex multi-node, rack-based AI systems.
Design test environments to simulate large-scale data center operations and test resilience under load and failure scenarios.
Design, develop, and maintain scalable test automation frameworks primarily using Python; build automated test suites interacting with APIs (REST, gMNI-gRPC), CLI, and UI.
Integrate automated tests into CI/CD pipelines for rapid feedback and release quality assurance.
Develop custom tools and harnesses to simulate managed devices, generate test data, and orchestrate complex test scenarios.
Conduct deep-dive performance analysis, benchmarking, and bottleneck identification across CPU, GPU, memory, PCIe, network fabric, storage I/O, and power delivery.
Analyze test data to identify trends, predict failures, and provide actionable insights for product improvements.
Provide expert troubleshooting and root cause analysis, collaborating with software development, SRE, and data center operations teams.
Participate in product design reviews and architectural discussions to ensure testability, reliability, and security from inception.

Requirements

Bachelor's or Master's degree in Computer Science, Electrical Engineering, or a related technical field.
8+ years of progressive experience in system-level testing, validation, or QA for complex hardware/software integrated products, with a strong focus on data center or high-performance computing (HPC) environments.
Proven experience in a lead or senior technical role, including mentoring junior engineers and defining test strategies.
Expertise in developing robust test automation frameworks and scripts using Python.
Deep understanding of rack-level system architectures, including servers (x86/ARM), GPUs, high-speed networking (Ethernet, InfiniBand), and enterprise storage (NVMe).
Experience with performance benchmarking and tuning tools for CPU, GPU, network, and storage.
Proficiency in Linux operating systems, including system administration and debugging.
Strong analytical, problem-solving, and debugging skills for complex, distributed systems.
Excellent communication and collaboration skills to work across multi-disciplinary teams.
Preferred: Direct experience with AI/ML hardware platforms and their unique testing challenges.
Preferred: Familiarity with AI frameworks (e.g., TensorFlow, PyTorch, JAX) and their resource utilization patterns.
Preferred: Experience with orchestration tools (e.g., Kubernetes, Slurm, OpenStack, Ansible, Chef, Puppet) for managing compute resources.
Preferred: Knowledge of data center power and thermal management principles.

Lead Test Orchestrator, Management Engineer

Job Level

Tech Stack

About the role

Requirements

Similar jobs on JobTailor

Cloud Technical Lead

Principal Software Cloud Engineer – Hybrid Cloud Platform Owner

Site Reliability Engineer

Senior Infrastructure Engineer

Senior Machine Learning Engineer