Nebius Group

Senior HPC Cluster Engineer

Nebius Group

full-time

Posted on:

Location Type: Remote

Location: United States

Visit company website

Explore more

AI Apply
Apply

Job Level

About the role

  • Tuning the performance of GPU clusters and InfiniBand networks to ensure optimal operation in HPC and GPU-based environments.
  • Analyzing and troubleshooting the root cause of issues related to GPUs and InfiniBand networks, and proposing corrective actions.
  • Integrating new hardware into the existing infrastructure, including support for new GPU hardware through software stacks like Kubernetes, QEMU, and KVM.
  • Enhancing automation systems for proactive monitoring, detecting, and resolving issues in GPU and InfiniBand environments.
  • Configuring and managing GPU devices and InfiniBand fabrics, ensuring efficient and reliable operation.

Requirements

  • 5+ years of professional experience in system-level software development (focused on performance optimization, low-level programming).
  • 3+ years of hands-on experience with Linux systems (administration, troubleshooting, and performance tuning).
  • In-depth understanding of server architecture, including PCIe devices, NICs, Linux OS/Kernel, and high-performance computing (HPC) systems.
  • Strong proficiency in one or more performance-oriented programming languages (C/C++, Go, Python).
Benefits
  • Competitive salary and comprehensive benefits package.
  • Opportunities for professional growth within Nebius.
  • Flexible working arrangements.
  • A dynamic and collaborative work environment that values initiative and innovation.
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
performance optimizationlow-level programmingLinux systemsserver architecturePCIe devicesNICshigh-performance computingCC++Python