• Own end-to-end performance for distributed AI workloads across multi-node clusters
• Benchmark and tune open-source & industry workloads on current and future hardware
• Design and optimize distributed serving topologies and validation efforts
• Build crisp proof points comparing Cornelis Omni-Path to competing interconnects
• Instrument and visualize performance and evangelize best practices

Requirements

B.S. in CS/EE/CE/Math or related
5–7+ years running AI/ML at cluster scale
Proven ability to set up, run, and analyze AI benchmarks
Hands-on with distributed training beyond single-GPU
Practical experience across AI stacks & comms: PyTorch, DeepSpeed, Megatron-LM, etc.
Comfortable with compilers and MPI stacks; Python + shell power user
Familiarity with network architectures and Linux systems
Excellent written and verbal communication

🔍 ATS Optimization Keywords
Below are skills and terms extracted directly from this job posting to improve Applicant Tracking System (ATS) visibility. This unique feature helps candidates tailor their applications more effectively — a feature exclusive to JobTailor job listings.

Hard Skills

AI workloads
ML at cluster scale
AI benchmarks
distributed training
PyTorch
DeepSpeed
Megatron-LM
compilers
MPI
Linux systems

Soft Skills

written communication
verbal communication

Certifications & Qualifications

B.S. in CS
B.S. in EE
B.S. in CE
B.S. in Math

AI Performance Engineer

Patent Engineer – AI/ML, Kernel & High-Performance Compute

Patent Engineer – AI/ML, Kernel & High-Performance Compute

AI Catalyst

Senior Director Analyst, Healthcare Technology and AI Markets

AI Intern