
HPC Specialist, Solutions Architect
Nebius Group
full-time
Posted on:
Location Type: Remote
Location: Netherlands
Visit company websiteExplore more
About the role
- Architect and implement scalable HPC clusters optimized for AI, simulation, and distributed training, leveraging container orchestration frameworks and schedulers (e.g., Kubernetes, Slurm).
- Design and integrate GPU-accelerated compute infrastructures featuring NVIDIA Hopper, Blackwell architectures, NVLink/NVSwitch, and InfiniBand/RoCE Interconnects.
- Deploy, and manage GPU Operator and Network Operator stacks for automated lifecycle management of GPU and high-speed networking components.
- Design and validate cloud HPC environments, focusing on low-latency, high-bandwidth networking, multi-GPU scaling, and efficient workload scheduling.
- Lead reference architectures for AI/ML model training, data pipelines, and MLOps integrations using modern observability and CI/CD tooling.
- Collaborate with hardware vendors (e.g., NVIDIA) and cloud providers to evaluate and optimize emerging HPC and GPU technologies.
- Benchmark system performance, identify bottlenecks, and tune resource utilization across compute, network, and storage tiers.
- Provide expert-level technical guidance to customers, internal teams, and partners on HPC architecture patterns, operational excellence reviews and customer engagements
Requirements
- Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field (Ph.D. a plus)
- 3+ years of hands-on experience architecting HPC or large-scale GPU clusters.
- Expertise in Linux systems, Kubernetes, container runtimes (containers, CRI-O, Docker), and related CI/CD practices.
- Strong understanding of HPC networking protocols and RDMA stacks (InfiniBand, NVLink/NVSwitch)
- Deep understanding of storage and I/O optimization for large datasets (Ceph, Lustre, NFS, GPUDirect Storage)
- Familiarity with Terraform, Ansible, Helm, and GitOps workflows.
- Strong scripting skills in Python or Bash for automation and tool integration.
- Excellent communication and documentation skills; ability to lead design reviews and customer engagements.
Benefits
- Competitive salary and comprehensive benefits package.
- Opportunities for professional growth within Nebius.
- Flexible working arrangements.
- A dynamic and collaborative work environment that values initiative and innovation.
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
HPC architectureGPU clustersLinux systemsKubernetesCI/CD practicesHPC networking protocolsRDMA stacksstorage optimizationscripting in Pythonscripting in Bash
Soft Skills
communication skillsdocumentation skillsleadershipcollaborationtechnical guidance