
Product Manager, Health Automation, Resilience
NVIDIA
full-time
Posted on:
Location Type: Hybrid
Location: Santa Clara • California, Washington • 🇺🇸 United States
Visit company websiteSalary
💰 $168,000 - $258,750 per year
Job Level
SeniorLead
Tech Stack
CloudDistributed Systems
About the role
- Establish the product vision and strategy for Health Automation and Resilience across DGX Cloud and partner GPU fleets.
- Partner with engineering on the architecture and delivery of software agents, services, control loops, and distributed health components.
- Convert hardware signals, telemetry pipelines, and operational insights into automation systems that reduce manual intervention.
- Work with cloud providers and enterprise operators to understand failure modes and operational challenges.
- Develop product specifications, technical requirements, and validation criteria for both internal and open-source components.
- Support go-to-market activities including documentation, demos, partner enablement, and release readiness.
- Track trends in observability, SRE practices, distributed systems, and automated operations to define long-term strategy.
- Lead product technical reviews, customer conversations, and planning sessions.
Requirements
- Bachelor’s degree in Computer Science, Engineering, or a similar area, or equivalent experience.
- 8+ years of relevant experience including demonstrated experience leading technical products within cloud infrastructure, distributed systems, reliability engineering, or related fields.
- Track record defining multi-quarter strategy and leading execution with multiple engineering teams.
- Ability to craft clear product requirements, work directly with engineering partners on technical decisions, and compose system-level workflows.
- Strong architectural understanding of control planes, telemetry systems, health monitoring, repair workflows, or automated remediation systems.
- Understanding of telemetry signals, SLOs, failure modes, and repair workflows in production environments.
- Experience building automation, resilience, or failure-recovery capabilities for large-scale cloud or HPC environments.
- Experience working with open-source technologies or products for software developers.
- Excellent communication skills across engineering, customers, and executives.
Benefits
- Equity
- Benefits 📊 Check your resume score for this job Improve your chances of getting an interview by checking your resume score before you apply. Check Resume Score
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
cloud infrastructuredistributed systemsreliability engineeringautomation systemstelemetry systemscontrol planeshealth monitoringrepair workflowsautomated remediation systemsopen-source technologies
Soft skills
communication skillsleadershipstrategic planningcollaborationtechnical decision-makingproduct visioncustomer engagementdocumentationpartner enablementexecution
Certifications
Bachelor’s degree in Computer ScienceBachelor’s degree in Engineering