NVIDIA

Senior Solutions Architect, Cloud Infrastructure and DevOps

NVIDIA

full-time

Posted on:

Origin:  • 🇦🇪 United Arab Emirates

Visit company website
AI Apply
Manual Apply

Job Level

Senior

Tech Stack

AnsibleChefCloudDNSJenkinsKubernetesLinuxPuppetPythonTCP/IP

About the role

  • Maintain large scale HPC/AI clusters with monitoring, logging and alerting
  • Manage Linux job/workload schedulers and orchestration tools
  • Develop and maintain continuous integration and delivery pipelines
  • Develop tooling to automate deployment and management of large-scale infrastructure environments, and enable self-service consumption of resources
  • Deploy monitoring solutions for servers, network and storage
  • Perform troubleshooting bottom up from bare metal, operating system, software stack and application level
  • Develop, re-define and document standard methodologies to share with internal teams
  • Support Research & Development activities and engage in POCs/POVs for future improvements
  • Interact with customers, partners and internal teams to analyze, define and implement large scale Networking projects
  • Act as a technical resource and customer-facing representative

Requirements

  • BS/MS/PhD or equivalent experience in Computer Science, Electrical/Computer Engineering, Physics, Mathematics, or related fields
  • At least 8 years of professional experience in networking fundamentals, TCP/IP stack, and data center architecture
  • Knowledge of HPC and AI solution technologies, including CPUs, GPUs, high-speed interconnects, and supporting software
  • Extensive knowledge and hands-on experience with Kubernetes, including container orchestration for AI/ML workloads, resource scheduling, scaling, and integration with HPC environments
  • Experience in managing and installing HPC clusters, including deployment, optimization, and troubleshooting
  • Excellent knowledge of Linux systems (Redhat/CentOS and Ubuntu), including internals, ACLs, OS-level security protections, and common protocols like TCP, DHCP, DNS
  • Experience with multiple storage solutions, including Lustre, GPFS, ZFS, and XFS
  • Proficiency in Python programming and bash scripting
  • Comfortable with automation and configuration management tools, including Jenkins, Ansible, Puppet/Chef
  • Excellent interpersonal skills and customer-facing experience
  • Familiarity with RDMA (InfiniBand or RoCE) fabrics (way to stand out)
  • Knowledge of CI/CD pipelines for software deployment and automation (way to stand out)
  • Experience with GPU-focused hardware/software (DGX, CUDA) (way to stand out)