
HPC/AI Infrastructure Architect
NTT
full-time
Posted on:
Location Type: Remote
Location: Remote • 🇬🇧 United Kingdom
Visit company websiteJob Level
SeniorLead
Tech Stack
DockerKubernetesNode.jsOpenShift
About the role
- Design GPU cluster architectures tailored for AI and HPC workloads.
- Define node configurations for diverse workload types including dense GPU nodes, cost-optimized nodes, and high-memory CPU nodes.
- Specify and validate performance metrics including compute throughput, memory bandwidth, and power consumption.
- Architect multi-tier interconnect networks using NVLink , InfiniBand, and high-speed Ethernet.
- Develop topology designs and calculate bandwidth and latency targets.
- Model performance for customer workloads and validate against industry benchmarks.
- Lead technical discussions with customer architects and stakeholders.
- Conduct workload sizing and architectural presentations.
- Develop technical content for proposals including BoMs, compliance matrices, and scoring alignment.
- Analyze competitor solutions and articulate technical differentiators.
- Design and expand lab infrastructure for AI workload testing and validation.
- Build reference architectures across industries such as finance, manufacturing, healthcare, and research.
- Support lab operations including cluster configuration, workload orchestration, and software stack maintenance.
- Deploy and showcase customer-specific AI workloads including LLM training, computer vision, and scientific simulations.
- Manage proof-of-concept projects, define success criteria, and present outcomes to stakeholders.
- Maintain relationships with key technology vendors and participate in early access programs.
- Evaluate emerging technologies and contribute to innovation roadmaps and adoption strategies.
Requirements
- 8+ years in HPC/AI infrastructure design
- 5+ years working with GPU-accelerated systems
- Proven experience with large-scale GPU deployments (1000+ GPUs)
- Successful track record in technical bid support and customer engagement
- Technical Competencies GPU Architectures: NVIDIA (H100, H200, B100, B200), AMD (MI300X), Intel (Gaudi2/3)
- Interconnects: InfiniBand (HDR/NDR/XDR), NVLink , RoCE, Infinity Fabric
- Storage Systems: Lustre , GPFS, BeeGFS , NVMe-oF , S3-compatible object storage
- Container Platforms: Kubernetes, Docker, Singularity/ Apptainer
- Performance Tools: NVIDIA Nsight, ROCm , Intel VTune
- Certifications (Preferred) NVIDIA Deep Learning Institute (DLI), Red Hat Certified Specialist in OpenShift, InfiniBand Certified Professional
Benefits
- Opportunity to work on cutting-edge AI infrastructure projects
- Collaborative and innovative work environment
- Access to advanced lab infrastructure and vendor technologies
- Career development through technical leadership and innovation
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
GPU cluster architecturesHPC workloadsperformance metricscompute throughputmemory bandwidthpower consumptiontopology designsbandwidth targetslatency targetslarge-scale GPU deployments
Soft skills
technical discussionscustomer engagementarchitectural presentationsrelationship managementinnovation roadmaps
Certifications
NVIDIA Deep Learning Institute (DLI)Red Hat Certified Specialist in OpenShiftInfiniBand Certified Professional