Full-stack Engineer

• Part of DGX Cloud team responsible for production systems enabling large scalable GPU clusters for AI workloads.
• Work on custom software related to managing fleets of GPU nodes.
• Implement monitoring and health management capabilities to enable reliability, availability, and scalability of GPU assets.
• Harness multiple data streams, from GPU hardware diagnostics to cluster and network telemetry.
• Work with teams across NVIDIA to ensure production AI clusters run reliably and consistently with maximum performance.
• Evaluate system failures and improve services based on incident management process.

Senior Software Engineer, Bare Metal Automation

Salary

Job Level

Tech Stack

About the role

Requirements