Nebius Group

HPC Specialist, Solutions Architect

Nebius Group

full-time

Posted on:

Location Type: Remote

Location: Netherlands

Visit company website

Explore more

AI Apply
Apply

About the role

  • Architect and implement scalable HPC clusters optimized for AI, simulation, and distributed training, leveraging container orchestration frameworks and schedulers (e.g., Kubernetes, Slurm).
  • Design and integrate GPU-accelerated compute infrastructures featuring NVIDIA Hopper, Blackwell architectures, NVLink/NVSwitch, and InfiniBand/RoCE Interconnects.
  • Deploy, and manage GPU Operator and Network Operator stacks for automated lifecycle management of GPU and high-speed networking components.
  • Design and validate cloud HPC environments, focusing on low-latency, high-bandwidth networking, multi-GPU scaling, and efficient workload scheduling.
  • Lead reference architectures for AI/ML model training, data pipelines, and MLOps integrations using modern observability and CI/CD tooling.
  • Collaborate with hardware vendors (e.g., NVIDIA) and cloud providers to evaluate and optimize emerging HPC and GPU technologies.
  • Benchmark system performance, identify bottlenecks, and tune resource utilization across compute, network, and storage tiers.
  • Provide expert-level technical guidance to customers, internal teams, and partners on HPC architecture patterns, operational excellence reviews and customer engagements

Requirements

  • Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field (Ph.D. a plus)
  • 3+ years of hands-on experience architecting HPC or large-scale GPU clusters.
  • Expertise in Linux systems, Kubernetes, container runtimes (containers, CRI-O, Docker), and related CI/CD practices.
  • Strong understanding of HPC networking protocols and RDMA stacks (InfiniBand, NVLink/NVSwitch)
  • Deep understanding of storage and I/O optimization for large datasets (Ceph, Lustre, NFS, GPUDirect Storage)
  • Familiarity with Terraform, Ansible, Helm, and GitOps workflows.
  • Strong scripting skills in Python or Bash for automation and tool integration.
  • Excellent communication and documentation skills; ability to lead design reviews and customer engagements.
Benefits
  • Competitive salary and comprehensive benefits package.
  • Opportunities for professional growth within Nebius.
  • Flexible working arrangements.
  • A dynamic and collaborative work environment that values initiative and innovation.
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
HPC architectureGPU clustersLinux systemsKubernetesCI/CD practicesHPC networking protocolsRDMA stacksstorage optimizationscripting in Pythonscripting in Bash
Soft Skills
communication skillsdocumentation skillsleadershipcollaborationtechnical guidance