Senior Product Manager - Observability and Resilience

NVIDIA

full-time

Posted on: 8/18/2025

Origin: • 🇺🇸 United States • California

Visit company website

✨ AI Apply

Manual Apply

Salary

💰 $208,000 - $327,750 per year

Job Level

Senior

Tech Stack

CloudDistributed SystemsDockerGrafanaKubernetesPrometheusSplunk

About the role

Be a subject‑matter expert on resiliency and observability.
Deeply understand failure modes across the GPU hardware, network, and software stack, along with the telemetry signals that reveal them, and how they correlate to workload health and SLOs.
Master modern reliability architectures. Keep up-to-date with the industry trends. Build for all that want to use.
Drive joint project planning. Define concrete achievements, tasks, and work for resiliency and observability initiatives with external partners.
Fuel innovation in reliability tooling. Lead ideation sessions to propose novel approaches and shape new proof‑of‑concepts.
Bridge development, SRE, and partner teams. Facilitate clear communication, triage emergent issues rapidly, and ensure feedback loops between engineering and customer operations remain tight.
Coordinate execution across different functions. Work with engineering, design, operations, sales, and marketing to embed resiliency and observability requirements into every product launch, capacity expansion, and lifecycle transition.

Requirements

BS or MS in Computer Science, Computer Engineering, or a related field (or equivalent experience) and 12+ years of product‑management experience in enterprise technology
Experience with GPU observability (DCGM, NVML, etc.) and integration into large‑scale telemetry systems
Deep knowledge of AI/ML infrastructure, HPC, networking, and cloud technologies (IaaS, PaaS) including containerization, Kubernetes, and automation tools
Familiarity with modern observability stacks: metrics, logs, traces, OpenTelemetry, Prometheus/Grafana, ELK/OpenSearch
Experience building and preferably deep understanding of secure, compliance‑focused telemetry pipelines (SOC2, FedRAMP)
Ability to articulate trade‑offs among latency, throughput, cost, and reliability to both engineering and executive audiences
Data-driven approach: defines SLIs/SLOs, manages error budgets, and develops value models
Strong cross‑functional execution: writes clear specs and PRDs, produces GTM collateral, and leads agile processes
Masters/PhD or Expertise in distributed systems, performance modeling, or fault‑tolerant computing
Experience with MLOps and LLMOps ecosystems and integrating with enterprise platforms; deployments at modern data‑center scale; delivered ML/AI observability solutions for LLMOps, predictive incident detection, or anomaly classification
Familiarity with MLOps toolchains and integrations with monitoring platforms such as Splunk, Datadog, and Grafana Cloud
Expertise with containerization technologies like Docker and Kubernetes, plus virtualization
Proficiency in network architecture and high‑performance interconnects (InfiniBand, Ethernet, RoCE)

Senior Product Manager - Observability and Resilience

Salary

Job Level

Tech Stack

About the role

Requirements

Similar jobs on JobTailor

Sr. DevOps Engineer

Observability Engineer

Staff DevOps Engineer

Senior Platform Engineer

Manager, SRE, FedRAMP