Performance Engineer – AI Infrastructure

Andromeda

full-time

Posted on: 2/27/2026

Location Type: Remote

✨ AI Apply

About the role

Conduct end-to-end profiling of training workloads to identify bottlenecks across GPU kernels, NCCL communication, and storage I/O
Collaborate with systems engineers to improve scheduling efficiency, collective communication performance, and kernel execution
Build and maintain high-fidelity tooling to monitor and visualize MFU, throughput, and cluster uptime
Design technical processes that help the team operate effectively and avoid repeating performance regressions

Proven experience running distributed training jobs on multi-GPU systems or HPC clusters
Strong programming skills in Python and C++ (Rust or CUDA experience is a major plus)
Solid understanding of PyTorch, JAX, or TensorFlow, and large-scale training loops
Familiarity with modern cloud infrastructure, including Kubernetes and Infrastructure as Code
Passion for measuring efficiency rigorously and translating raw profiling data into practical engineering improvements.

Benefits

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

PythonC++RustCUDAPyTorchJAXTensorFlowdistributed trainingmulti-GPU systemsHPC clusters

Soft Skills

collaborationproblem-solvingefficiency measurementtechnical process design