
Senior Solutions Architect, CSP System
NVIDIA
full-time
Posted on:
Location Type: Office
Location: Shenzhen • China
Visit company websiteExplore more
Job Level
Tech Stack
About the role
- Work with Sales, BD and CPM team to introduce NVIDIA technologies into assigned accounts and grow business accordingly.
- Lead the design, development, and optimization of Kubernetes-based infrastructure solutions for Agentic AI and Agentic RL workloads, addressing core challenges including massive concurrent sandbox scheduling, millisecond-level elasticity, secure isolation, and full-scenario interactive environment support.
- Collaborate closely with NVIDIA’s CSP partners (major cloud service providers in China) to understand their Agentic AI/RL business needs, provide professional K8s technical guidance, and tailor infrastructure solutions that align with NVIDIA’s accelerated computing technologies (such as NVIDIA AI Enterprise, GB200 platform, and NVCF).
- Optimize Kubernetes clusters to support high-throughput, low-latency Agentic RL training and inference workloads, including resource scheduling strategy optimization, GPU resource management, network and storage performance tuning, and solving bottlenecks in large-scale Pod creation and scheduling.
- Design and implement Agent Infra core components based on K8s, such as secure sandbox environments, interactive trajectory recording, checkpoint breakpoint replay, and full-link observability tools, to support the end-to-end lifecycle of Agentic AI/RL development and deployment.
- Work with cross-functional teams (NVIDIA’s R&D, solution architecture, and technical support teams) to promote the integration of K8s with NVIDIA’s software and hardware ecosystem, including NVIDIA Operators, Dynamo, Grove, and KAI Scheduler, to achieve optimal performance of Agentic workloads.
- Provide technical leadership in K8s and Agentic AI/RL Infra fields, guide junior engineers, and drive the continuous iteration and improvement of infrastructure solutions based on industry best practices and customer feedback.
- Stay abreast of the latest trends in Kubernetes, Agentic AI, Agentic RL, and cloud-native infrastructure, introduce advanced technologies and solutions into NVIDIA’s CSP ecosystem, and promote technological innovation and standardization.
- Participate in technical pre-sales support, solution demonstration, and technical training for CSP partners, helping partners master K8s-based Agentic AI/RL Infra construction and operation capabilities.
Requirements
- Bachelor’s degree or above in Computer Science, Software Engineering, Electrical Engineering, or a related field; master’s degree is preferred.
- 10+ years of hands-on experience in Kubernetes development, operation, and optimization, with deep expertise in K8s core components (kube-apiserver, etcd, kube-scheduler, kubelet) and custom resource development (CRD/Operator).
- Proven experience in building and optimizing infrastructure for AI/ML workloads, with in-depth understanding of Agentic AI and Agentic RL concepts, and practical experience in supporting Agentic RL training or inference workloads on K8s is a strong plus.
- Proficiency in containerization technologies (Docker, containerd), container network solutions (Calico, Cilium), and storage solutions (Ceph, GlusterFS), with experience in optimizing network and storage performance for high-concurrency AI workloads.
- Strong experience in GPU resource management on K8s, familiar with NVIDIA GPU Operator, CUDA, and accelerated computing technologies, and able to optimize GPU utilization for Agentic AI/RL workloads.
- Excellent programming skills, proficient in at least one programming language (Python, Go, C++), with the ability to develop custom K8s controllers, plugins, or automation tools.
- Deep understanding of cloud-native architecture and best practices, experience in working with major CSPs (Alibaba Cloud, Tencent Cloud, Huawei Cloud, etc.) is highly preferred.
- Fluent in spoken and written English, able to communicate effectively with global cross-functional teams and read technical documentation in English.
- Strong problem-solving skills, ability to identify and resolve complex K8s and Agentic AI/RL Infra technical issues independently, and a proactive and result-driven work attitude.
Benefits
- Competitive salaries and a generous benefits package
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
KubernetesAI/ML infrastructure optimizationcontainerization technologiesGPU resource managementprogramming (Python, Go, C++)custom resource developmentnetwork performance optimizationstorage solutions (Ceph, GlusterFS)cloud-native architecturetechnical leadership
Soft Skills
problem-solvingcommunicationcollaborationtechnical guidanceproactive attituderesult-driven mindsetleadershipcross-functional teamworkcustomer feedback integrationtraining and demonstration