Full-stack Engineer

• Part of an DGX Cloud team responsible for production systems that enable large scalable GPU clusters to be used for a variety of AI workloads.
• Working on custom software related to managing fleets of GPU nodes.
• Implementing monitoring and health management capabilities that enable industry leading reliability, availability, and scalability of GPU assets.
• Harnessing multiple data streams, ranging from GPU hardware diagnostics to cluster and network telemetry.
• Working with teams across NVIDIA to ensure production AI clusters run reliability and consistently with maximum performance.
• Evaluating system failures and improving services based on a well-defined incident management process.

Senior Software Engineer, Bare Metal Automation, DGX Cloud

Salary

Job Level

Tech Stack

About the role

Requirements

Developer Relations Manager, Telco GSI

Senior Computer Vision System Performance Engineer

Product Engineer, DCIM

Senior System Engineer

Senior Solutions Architect, Generative AI