Full-stack Engineer

• Part of a DGX Cloud team responsible for production systems that enable large scalable GPU clusters for AI workloads
• Work on custom software related to managing fleets of GPU nodes
• Implement monitoring and health management capabilities for reliability, availability, and scalability of GPU assets
• Harness multiple data streams, from GPU hardware diagnostics to cluster and network telemetry
• Work with teams across NVIDIA to ensure production AI clusters run reliably and consistently with maximum performance
• Evaluate system failures and improve services based on a well-defined incident management process

Senior Software Engineer, Bare Metal Automation

Salary

Job Level

Tech Stack

About the role

Requirements

Security Engineer, Detection and Response

Data Scientist III

Platform Engineer, AI/ML Infrastructure

Security engineer, detection and response

Software Engineer – Computer Vision, Annotation / ML Deployment