DDN

Senior Benchmark and Performance Engineer – AI and Storage Systems

DDN

full-time

Posted on:

Origin:  • 🇺🇸 United States

Visit company website
AI Apply
Manual Apply

Job Level

Senior

Tech Stack

CloudGrafanaKubernetesLinuxNode.jsPrometheusPythonPyTorchTensorflow

About the role

  • Design and execute performance benchmarks across AI, HPC, and storage platforms
  • Run and tune AI inference workloads using frameworks such as PyTorch, TensorFlow, Triton, NVIDIA NIMs, and vector databases
  • Benchmark large-scale RAG pipelines including data ingestion, retrieval, and inference performance
  • Profile and optimize MPI and multi-node distributed applications
  • Compile and debug C/C++, Python, and CUDA-based codes across heterogeneous systems
  • Generate automated test scripts and benchmarking workflows (e.g., with Bash, Python, or Slurm job scripts)
  • Analyze and visualize results using Excel, Jupyter, or reporting tools; create comparison graphs and KPIs
  • Write clear, concise performance reports for both technical and non-technical stakeholders
  • Present findings internally and externally, translating results into architectural guidance for field engineers and sales teams
  • Collaborate with system engineers, product managers, and partners to tune and improve software/hardware stack performance
  • Validate and tune performance on storage systems including parallel file systems (e.g., Lustre, GPFS), object storage, and NVMe over Fabrics
  • Contribute to internal tooling to automate test cycles and performance regression tracking

Requirements

  • 7+ years of experience in performance engineering, benchmarking, or HPC/AI systems
  • Deep experience with AI/ML and deep learning frameworks (PyTorch, TensorFlow, ONNX, Triton)
  • Familiarity with NVIDIA NIMs and containerized model serving stacks
  • Proven expertise with MPI, OpenMP, Slurm or similar schedulers in large-scale compute environments
  • Solid understanding of file and storage systems (e.g., POSIX, Lustre, S3, NVMe-oF)
  • Strong Linux skills (debugging, tuning, networking, storage stack)
  • Proficiency in scripting (e.g., Bash, Python) for job orchestration and result parsing
  • Ability to create clear Excel graphs and presentations from raw benchmark data
  • Strong communication skills — able to convey technical results and trade-offs to engineering and customer-facing teams
  • Preferred: Experience with RAG pipelines, vector databases (e.g., FAISS, Milvus, Qdrant)
  • Familiarity with Kubernetes and CSI-based persistent volume systems
  • Understanding of GPU profiling tools (Nsight, nvprof, PyTorch Profiler)
  • Knowledge of telemetry and monitoring frameworks (e.g., Prometheus, Grafana)
  • Prior work publishing or presenting technical performance results