
Performance Engineer – AI Infrastructure
Andromeda
full-time
Posted on:
Location Type: Remote
Location: California • United States
Visit company websiteExplore more
About the role
- Conduct end-to-end profiling of training workloads to identify bottlenecks across GPU kernels, NCCL communication, and storage I/O
- Collaborate with systems engineers to improve scheduling efficiency, collective communication performance, and kernel execution
- Build and maintain high-fidelity tooling to monitor and visualize MFU, throughput, and cluster uptime
- Design technical processes that help the team operate effectively and avoid repeating performance regressions
Requirements
- Proven experience running distributed training jobs on multi-GPU systems or HPC clusters
- Strong programming skills in Python and C++ (Rust or CUDA experience is a major plus)
- Solid understanding of PyTorch, JAX, or TensorFlow, and large-scale training loops
- Familiarity with modern cloud infrastructure, including Kubernetes and Infrastructure as Code
- Passion for measuring efficiency rigorously and translating raw profiling data into practical engineering improvements.
Benefits
- Ownership and autonomy to shape how systems run
- Celebrate diversity and create an inclusive environment
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
PythonC++RustCUDAPyTorchJAXTensorFlowdistributed trainingmulti-GPU systemsHPC clusters
Soft Skills
collaborationproblem-solvingefficiency measurementtechnical process design