NTT

HPC/AI Infrastructure Architect

NTT

full-time

Posted on:

Location Type: Remote

Location: Remote • 🇬🇧 United Kingdom

Visit company website
AI Apply
Apply

Job Level

SeniorLead

Tech Stack

DockerKubernetesNode.jsOpenShift

About the role

  • Design GPU cluster architectures tailored for AI and HPC workloads.
  • Define node configurations for diverse workload types including dense GPU nodes, cost-optimized nodes, and high-memory CPU nodes.
  • Specify and validate performance metrics including compute throughput, memory bandwidth, and power consumption.
  • Architect multi-tier interconnect networks using NVLink , InfiniBand, and high-speed Ethernet.
  • Develop topology designs and calculate bandwidth and latency targets.
  • Model performance for customer workloads and validate against industry benchmarks.
  • Lead technical discussions with customer architects and stakeholders.
  • Conduct workload sizing and architectural presentations.
  • Develop technical content for proposals including BoMs, compliance matrices, and scoring alignment.
  • Analyze competitor solutions and articulate technical differentiators.
  • Design and expand lab infrastructure for AI workload testing and validation.
  • Build reference architectures across industries such as finance, manufacturing, healthcare, and research.
  • Support lab operations including cluster configuration, workload orchestration, and software stack maintenance.
  • Deploy and showcase customer-specific AI workloads including LLM training, computer vision, and scientific simulations.
  • Manage proof-of-concept projects, define success criteria, and present outcomes to stakeholders.
  • Maintain relationships with key technology vendors and participate in early access programs.
  • Evaluate emerging technologies and contribute to innovation roadmaps and adoption strategies.

Requirements

  • 8+ years in HPC/AI infrastructure design
  • 5+ years working with GPU-accelerated systems
  • Proven experience with large-scale GPU deployments (1000+ GPUs)
  • Successful track record in technical bid support and customer engagement
  • Technical Competencies GPU Architectures: NVIDIA (H100, H200, B100, B200), AMD (MI300X), Intel (Gaudi2/3)
  • Interconnects: InfiniBand (HDR/NDR/XDR), NVLink , RoCE, Infinity Fabric
  • Storage Systems: Lustre , GPFS, BeeGFS , NVMe-oF , S3-compatible object storage
  • Container Platforms: Kubernetes, Docker, Singularity/ Apptainer
  • Performance Tools: NVIDIA Nsight, ROCm , Intel VTune
  • Certifications (Preferred) NVIDIA Deep Learning Institute (DLI), Red Hat Certified Specialist in OpenShift, InfiniBand Certified Professional
Benefits
  • Opportunity to work on cutting-edge AI infrastructure projects
  • Collaborative and innovative work environment
  • Access to advanced lab infrastructure and vendor technologies
  • Career development through technical leadership and innovation

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills
GPU cluster architecturesHPC workloadsperformance metricscompute throughputmemory bandwidthpower consumptiontopology designsbandwidth targetslatency targetslarge-scale GPU deployments
Soft skills
technical discussionscustomer engagementarchitectural presentationsrelationship managementinnovation roadmaps
Certifications
NVIDIA Deep Learning Institute (DLI)Red Hat Certified Specialist in OpenShiftInfiniBand Certified Professional