
Senior Solutions Architect, Cloud Infrastructure, DevOps
NVIDIA
full-time
Posted on:
Location Type: Remote
Location: South Korea
Visit company websiteExplore more
Job Level
About the role
- Maintain large scale computational and AI infrastructure, focusing on monitoring, logging, workload orchestration (Kubernetes and Linux job schedulers).
- Perform end-to-end resolving across the stack, from bare metal and operating system, through the software stack, container platform, networking, and storage.
- Optimize scalable, production-ready Kubernetes-based container platforms coordinated with enterprise-grade networking and storage.
- Serve as a key technical resource, develop, refine, and document standard methodologies and operational guidelines to be shared with internal teams.
- Support Research & Development activities and engage in POCs/POVs to validate new features, architectures, and upgrade approaches.
- Create and deliver high-quality documentation, including runbooks, onboarding materials, and best-practice guides for customers and internal teams.
- Become the technical leader for assigned customer accounts, providing strategic guidance on DevOps and platform architecture and influencing long-term infrastructure and operations decisions.
Requirements
- BS/MS/PhD in Computer Science, Electrical/Computer Engineering, Physics, Mathematics, or related fields, with 8+ years of professional experience in managing scalable cloud environments and automation engineering roles.
- Proven understanding of networking fundamentals (TCP/IP stack), data center architectures, and hands-on experience managing HPC/AI clusters, including deployment, optimization, and fixing issues.
- Extensive experience with Kubernetes for container orchestration, resource scheduling, scaling, and integration with HPC environments.
- Familiarity with HPC and AI technologies (CPUs, GPUs, high-speed interconnects) and supporting software stacks.
- Deep knowledge of Linux (RedHat/CentOS, Ubuntu), OS-level security, and protocols (TCP, DHCP, DNS).
- Experience with storage solutions such as Lustre, GPFS, ZFS, XFS, and emerging Kubernetes storage technologies.
- Proficiency in Python and Bash scripting, configuration management, and Infrastructure-as-Code tools (e.g., Ansible, Terraform).
- Experience with observability stacks (Grafana, Loki, Prometheus) for monitoring, logging, and building fault-tolerant systems.
- Strong background in crafting scalable solutions and providing consultative support to customers.
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
KubernetesLinuxPythonBash scriptingAnsibleTerraformHPCAI technologiesnetworking fundamentalsstorage solutions
Soft skills
technical leadershipstrategic guidancedocumentationconsultative support