Celestica

Lead Test Orchestrator, Management Engineer

Celestica

full-time

Posted on:

Origin:  • 🇺🇸 United States

Visit company website
AI Apply
Apply

Job Level

Senior

Tech Stack

AnsibleChefDistributed SystemsGRPCKubernetesLinuxNode.jsOpenStackPuppetPythonPyTorchTensorflow

About the role

  • Lead strategy, definition, building, and operation of automated test infrastructure and methodologies for rack-based AI data center products.
  • Define, develop, and lead execution of comprehensive test strategy covering functional, performance, scalability, reliability, security, and stress testing of control plane and data plane interactions.
  • Architect and implement test methodologies to validate orchestration, provisioning, monitoring, and management of complex multi-node, rack-based AI systems.
  • Design test environments to simulate large-scale data center operations and test resilience under load and failure scenarios.
  • Design, develop, and maintain scalable test automation frameworks primarily using Python; build automated test suites interacting with APIs (REST, gMNI-gRPC), CLI, and UI.
  • Integrate automated tests into CI/CD pipelines for rapid feedback and release quality assurance.
  • Develop custom tools and harnesses to simulate managed devices, generate test data, and orchestrate complex test scenarios.
  • Conduct deep-dive performance analysis, benchmarking, and bottleneck identification across CPU, GPU, memory, PCIe, network fabric, storage I/O, and power delivery.
  • Analyze test data to identify trends, predict failures, and provide actionable insights for product improvements.
  • Provide expert troubleshooting and root cause analysis, collaborating with software development, SRE, and data center operations teams.
  • Participate in product design reviews and architectural discussions to ensure testability, reliability, and security from inception.

Requirements

  • Bachelor's or Master's degree in Computer Science, Electrical Engineering, or a related technical field.
  • 8+ years of progressive experience in system-level testing, validation, or QA for complex hardware/software integrated products, with a strong focus on data center or high-performance computing (HPC) environments.
  • Proven experience in a lead or senior technical role, including mentoring junior engineers and defining test strategies.
  • Expertise in developing robust test automation frameworks and scripts using Python.
  • Deep understanding of rack-level system architectures, including servers (x86/ARM), GPUs, high-speed networking (Ethernet, InfiniBand), and enterprise storage (NVMe).
  • Experience with performance benchmarking and tuning tools for CPU, GPU, network, and storage.
  • Proficiency in Linux operating systems, including system administration and debugging.
  • Strong analytical, problem-solving, and debugging skills for complex, distributed systems.
  • Excellent communication and collaboration skills to work across multi-disciplinary teams.
  • Preferred: Direct experience with AI/ML hardware platforms and their unique testing challenges.
  • Preferred: Familiarity with AI frameworks (e.g., TensorFlow, PyTorch, JAX) and their resource utilization patterns.
  • Preferred: Experience with orchestration tools (e.g., Kubernetes, Slurm, OpenStack, Ansible, Chef, Puppet) for managing compute resources.
  • Preferred: Knowledge of data center power and thermal management principles.
GoodData

Cloud Technical Lead

GoodData
Seniorfull-time🇨🇿 Czech
Posted: 3 days agoSource: jobs.ashbyhq.com
AnsibleAWSAzureCloudDockerGoogle Cloud PlatformKubernetesLinuxOpenStackPuppetPythonShell Scripting+3 more
Medtronic

Principal Software Cloud Engineer – Hybrid Cloud Platform Owner

Medtronic
Leadfull-time$143k–$215k / year🇺🇸 United States
Posted: 28 days agoSource: medtronic.wd1.myworkdayjobs.com
AnsibleAWSAzureCloudDockerKubernetesOpenShiftTerraformVMware
Infleqtion

Site Reliability Engineer

Infleqtion
Junior · Midfull-time🇬🇧 United Kingdom
Posted: 21 days agoSource: apply.workable.com
AnsibleDockerIoTKubernetesLinuxPythonSwift
Upwork

Senior Infrastructure Engineer

Upwork
Seniorcontract💃 Anywhere in Latin America
Posted: 20 days agoSource: boards.greenhouse.io
AWSAzureChefCloudDockerEC2Google Cloud PlatformJenkinsKubernetesPackerPrometheusPython+1 more
Samsara

Senior Machine Learning Engineer

Samsara
Seniorfull-time$133k–$172k / year🇨🇦 Canada
Posted: 43 minutes agoSource: boards.greenhouse.io
CloudDockerGoIoTJavaKubernetesPythonPyTorchRayScalaSparkTensorflow