NVIDIA

Senior Solutions Architect, Cloud Infrastructure, DevOps

NVIDIA

full-time

Posted on:

Location Type: Remote

Location: South Korea

Visit company website

Explore more

AI Apply
Apply

Job Level

About the role

  • Maintain large scale computational and AI infrastructure, focusing on monitoring, logging, workload orchestration (Kubernetes and Linux job schedulers).
  • Perform end-to-end resolving across the stack, from bare metal and operating system, through the software stack, container platform, networking, and storage.
  • Optimize scalable, production-ready Kubernetes-based container platforms coordinated with enterprise-grade networking and storage.
  • Serve as a key technical resource, develop, refine, and document standard methodologies and operational guidelines to be shared with internal teams.
  • Support Research & Development activities and engage in POCs/POVs to validate new features, architectures, and upgrade approaches.
  • Create and deliver high-quality documentation, including runbooks, onboarding materials, and best-practice guides for customers and internal teams.
  • Become the technical leader for assigned customer accounts, providing strategic guidance on DevOps and platform architecture and influencing long-term infrastructure and operations decisions.

Requirements

  • BS/MS/PhD in Computer Science, Electrical/Computer Engineering, Physics, Mathematics, or related fields, with 8+ years of professional experience in managing scalable cloud environments and automation engineering roles.
  • Proven understanding of networking fundamentals (TCP/IP stack), data center architectures, and hands-on experience managing HPC/AI clusters, including deployment, optimization, and fixing issues.
  • Extensive experience with Kubernetes for container orchestration, resource scheduling, scaling, and integration with HPC environments.
  • Familiarity with HPC and AI technologies (CPUs, GPUs, high-speed interconnects) and supporting software stacks.
  • Deep knowledge of Linux (RedHat/CentOS, Ubuntu), OS-level security, and protocols (TCP, DHCP, DNS).
  • Experience with storage solutions such as Lustre, GPFS, ZFS, XFS, and emerging Kubernetes storage technologies.
  • Proficiency in Python and Bash scripting, configuration management, and Infrastructure-as-Code tools (e.g., Ansible, Terraform).
  • Experience with observability stacks (Grafana, Loki, Prometheus) for monitoring, logging, and building fault-tolerant systems.
  • Strong background in crafting scalable solutions and providing consultative support to customers.

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills
KubernetesLinuxPythonBash scriptingAnsibleTerraformHPCAI technologiesnetworking fundamentalsstorage solutions
Soft skills
technical leadershipstrategic guidancedocumentationconsultative support