Senior Technical Program Manager - DGX Cloud Storage

NVIDIA

full-time

Posted on: 8/20/2025

Origin: • 🇺🇸 United States • California

✨ AI Apply

💰 $192,000 - $368,000 per year

Senior

AWSAzureCloudGoogle Cloud Platform

About the role

NVIDIA’s DGX Cloud is redefining how organizations deploy and scale AI infrastructure; Senior TPM to drive storage-related initiatives across development, operations, and cloud deployment.
Lead cross-functional storage programs from requirements gathering through execution and delivery; drive alignment across NVIDIA storage engineering, operations, cloud service providers, clusters operators, resource governance and finance.
Define project plans, schedules, and achievements for storage features, storage deployments, support, security, compliance, and observability.
Partner with the engineering team and product management to define and deliver products roadmap.
Manage technical risks and resolve blockers that impact quality, scope, and delivery timelines.
Coordinate with cross-functional teams to improve workflows, efficiency, and transparency.
Ensure program visibility across the organization and maintain strong communication channels with senior stakeholders.
Improve organizational efficiency by collaborating with multi-functional leads and optimizing processes Cultivate a culture of continuous improvement, finding opportunities for process enhancements

12+ years of experience in program management of large-scale software or infrastructure projects
MS EE or CS degree, or equivalent experience
Proven success driving programs across global, distributed teams.
Outstanding communication and organizational skills, with the ability to align cross-org stakeholders.
Expertise with tools like Jira and Confluence, and the ability to guide teams in their use.
Strong foundation in software development, Agile methodologies, and DevOps best practices.
Familiarity with Cloud Platforms: AWS, Azure, GCP, or OCI storage services (Block, Object, File)
Knowledge of Distributed Storage Systems: SAN, NAS, object storage, and scalable distributed architectures such as Ceph or Lustre.
Storage Performance: Understanding IOPS, latency, throughput optimization, and capacity planning for large-scale environments
Data Protection & DR: Familiarity with snapshots, backups, replication, and disaster recovery strategies
AI/ML & HPC Workloads: Understanding storage requirements for high-throughput AI training or data pipelines