NVIDIA

Product Manager, Health Automation, Resilience

NVIDIA

full-time

Posted on:

Location Type: Hybrid

Location: Santa Clara • California, Washington • 🇺🇸 United States

Visit company website
AI Apply
Apply

Salary

💰 $168,000 - $258,750 per year

Job Level

SeniorLead

Tech Stack

CloudDistributed Systems

About the role

  • Establish the product vision and strategy for Health Automation and Resilience across DGX Cloud and partner GPU fleets.
  • Partner with engineering on the architecture and delivery of software agents, services, control loops, and distributed health components.
  • Convert hardware signals, telemetry pipelines, and operational insights into automation systems that reduce manual intervention.
  • Work with cloud providers and enterprise operators to understand failure modes and operational challenges.
  • Develop product specifications, technical requirements, and validation criteria for both internal and open-source components.
  • Support go-to-market activities including documentation, demos, partner enablement, and release readiness.
  • Track trends in observability, SRE practices, distributed systems, and automated operations to define long-term strategy.
  • Lead product technical reviews, customer conversations, and planning sessions.

Requirements

  • Bachelor’s degree in Computer Science, Engineering, or a similar area, or equivalent experience.
  • 8+ years of relevant experience including demonstrated experience leading technical products within cloud infrastructure, distributed systems, reliability engineering, or related fields.
  • Track record defining multi-quarter strategy and leading execution with multiple engineering teams.
  • Ability to craft clear product requirements, work directly with engineering partners on technical decisions, and compose system-level workflows.
  • Strong architectural understanding of control planes, telemetry systems, health monitoring, repair workflows, or automated remediation systems.
  • Understanding of telemetry signals, SLOs, failure modes, and repair workflows in production environments.
  • Experience building automation, resilience, or failure-recovery capabilities for large-scale cloud or HPC environments.
  • Experience working with open-source technologies or products for software developers.
  • Excellent communication skills across engineering, customers, and executives.
Benefits
  • Equity
  • Benefits 📊 Check your resume score for this job Improve your chances of getting an interview by checking your resume score before you apply. Check Resume Score

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills
cloud infrastructuredistributed systemsreliability engineeringautomation systemstelemetry systemscontrol planeshealth monitoringrepair workflowsautomated remediation systemsopen-source technologies
Soft skills
communication skillsleadershipstrategic planningcollaborationtechnical decision-makingproduct visioncustomer engagementdocumentationpartner enablementexecution
Certifications
Bachelor’s degree in Computer ScienceBachelor’s degree in Engineering