Andromeda

Performance Engineer – AI Infrastructure

Andromeda

full-time

Posted on:

Location Type: Remote

Location: CaliforniaUnited States

Visit company website

Explore more

AI Apply
Apply

About the role

  • Conduct end-to-end profiling of training workloads to identify bottlenecks across GPU kernels, NCCL communication, and storage I/O
  • Collaborate with systems engineers to improve scheduling efficiency, collective communication performance, and kernel execution
  • Build and maintain high-fidelity tooling to monitor and visualize MFU, throughput, and cluster uptime
  • Design technical processes that help the team operate effectively and avoid repeating performance regressions

Requirements

  • Proven experience running distributed training jobs on multi-GPU systems or HPC clusters
  • Strong programming skills in Python and C++ (Rust or CUDA experience is a major plus)
  • Solid understanding of PyTorch, JAX, or TensorFlow, and large-scale training loops
  • Familiarity with modern cloud infrastructure, including Kubernetes and Infrastructure as Code
  • Passion for measuring efficiency rigorously and translating raw profiling data into practical engineering improvements.
Benefits
  • Ownership and autonomy to shape how systems run
  • Celebrate diversity and create an inclusive environment
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
PythonC++RustCUDAPyTorchJAXTensorFlowdistributed trainingmulti-GPU systemsHPC clusters
Soft Skills
collaborationproblem-solvingefficiency measurementtechnical process design