NVIDIA

Senior Product Manager - Observability and Resilience

NVIDIA

full-time

Posted on:

Origin:  • 🇺🇸 United States • California

Visit company website
AI Apply
Manual Apply

Salary

💰 $208,000 - $327,750 per year

Job Level

Senior

Tech Stack

CloudDistributed SystemsDockerGrafanaKubernetesPrometheusSplunk

About the role

  • Be a subject‑matter expert on resiliency and observability.
  • Deeply understand failure modes across the GPU hardware, network, and software stack, along with the telemetry signals that reveal them, and how they correlate to workload health and SLOs.
  • Master modern reliability architectures. Keep up-to-date with the industry trends. Build for all that want to use.
  • Drive joint project planning. Define concrete achievements, tasks, and work for resiliency and observability initiatives with external partners.
  • Fuel innovation in reliability tooling. Lead ideation sessions to propose novel approaches and shape new proof‑of‑concepts.
  • Bridge development, SRE, and partner teams. Facilitate clear communication, triage emergent issues rapidly, and ensure feedback loops between engineering and customer operations remain tight.
  • Coordinate execution across different functions. Work with engineering, design, operations, sales, and marketing to embed resiliency and observability requirements into every product launch, capacity expansion, and lifecycle transition.

Requirements

  • BS or MS in Computer Science, Computer Engineering, or a related field (or equivalent experience) and 12+ years of product‑management experience in enterprise technology
  • Experience with GPU observability (DCGM, NVML, etc.) and integration into large‑scale telemetry systems
  • Deep knowledge of AI/ML infrastructure, HPC, networking, and cloud technologies (IaaS, PaaS) including containerization, Kubernetes, and automation tools
  • Familiarity with modern observability stacks: metrics, logs, traces, OpenTelemetry, Prometheus/Grafana, ELK/OpenSearch
  • Experience building and preferably deep understanding of secure, compliance‑focused telemetry pipelines (SOC2, FedRAMP)
  • Ability to articulate trade‑offs among latency, throughput, cost, and reliability to both engineering and executive audiences
  • Data-driven approach: defines SLIs/SLOs, manages error budgets, and develops value models
  • Strong cross‑functional execution: writes clear specs and PRDs, produces GTM collateral, and leads agile processes
  • Masters/PhD or Expertise in distributed systems, performance modeling, or fault‑tolerant computing
  • Experience with MLOps and LLMOps ecosystems and integrating with enterprise platforms; deployments at modern data‑center scale; delivered ML/AI observability solutions for LLMOps, predictive incident detection, or anomaly classification
  • Familiarity with MLOps toolchains and integrations with monitoring platforms such as Splunk, Datadog, and Grafana Cloud
  • Expertise with containerization technologies like Docker and Kubernetes, plus virtualization
  • Proficiency in network architecture and high‑performance interconnects (InfiniBand, Ethernet, RoCE)