
HPC Specialist, Solutions Architect
Nebius Group
full-time
Posted on:
Location Type: Remote
Location: United States
Visit company websiteExplore more
Salary
💰 $225,000 - $315,000 per year
About the role
- Architect and implement scalable HPC clusters optimized for AI, simulation, and distributed training, leveraging container orchestration frameworks and schedulers (e.g., Kubernetes, Slurm).
- Design and integrate GPU-accelerated compute infrastructures featuring NVIDIA Hopper, Blackwell architectures, NVLink/NVSwitch, and InfiniBand/RoCE Interconnects.
- Deploy, and manage GPU Operator and Network Operator stacks for automated lifecycle management of GPU and high-speed networking components.
- Design and validate cloud HPC environments, focusing on low-latency, high-bandwidth networking, multi-GPU scaling, and efficient workload scheduling.
- Lead reference architectures for AI/ML model training, data pipelines, and MLOps integrations using modern observability and CI/CD tooling.
- Collaborate with hardware vendors (e.g., NVIDIA) and cloud providers to evaluate and optimize emerging HPC and GPU technologies.
- Benchmark system performance, identify bottlenecks, and tune resource utilization across compute, network, and storage tiers.
- Provide expert-level technical guidance to customers, internal teams, and partners on HPC architecture patterns, operational excellence reviews and customer engagements
Requirements
- Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field (Ph.D. a plus)
- 3+ years of hands-on experience architecting HPC or large-scale GPU clusters.
- Expertise in Linux systems, Kubernetes, container runtimes (containers, CRI-O, Docker), and related CI/CD practices.
- Strong understanding of HPC networking protocols and RDMA stacks (InfiniBand, NVLink/NVSwitch)
- Deep understanding of storage and I/O optimization for large datasets (Ceph, Lustre, NFS, GPUDirect Storage)
- Familiarity with Terraform, Ansible, Helm, and GitOps workflows.
- Strong scripting skills in Python or Bash for automation and tool integration.
- Excellent communication and documentation skills; ability to lead design reviews and customer engagements.
Benefits
- Health Insurance: 100% company-paid medical, dental, and vision coverage for employees and families.
- 401(k) Plan: Up to 4% company match with immediate vesting.
- Parental Leave: 20 weeks paid for primary caregivers, 12 weeks for secondary caregivers.
- Remote Work Reimbursement: Up to $85/month for mobile and internet.
- Disability & Life Insurance: Company-paid short-term, long-term, and life insurance coverage.
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
HPC architectureGPU clustersLinux systemsKubernetesCI/CD practicesHPC networking protocolsRDMA stacksstorage optimizationscripting in Pythonscripting in Bash
Soft Skills
communication skillsdocumentation skillsleadershipcollaboration
Certifications
Bachelor's degreeMaster's degreePh.D.