Salary
💰 $148,000 - $287,500 per year
About the role
- Part of DGX Cloud team responsible for production systems enabling large scalable GPU clusters for AI workloads.
- Work on custom software related to managing fleets of GPU nodes.
- Implement monitoring and health management capabilities to enable reliability, availability, and scalability of GPU assets.
- Harness multiple data streams, from GPU hardware diagnostics to cluster and network telemetry.
- Work with teams across NVIDIA to ensure production AI clusters run reliably and consistently with maximum performance.
- Evaluate system failures and improve services based on incident management process.
Requirements
- Direct experience in a software engineering role within a highly technical organization with demonstrable impact from your work.
- Software development experience with bare metal hardware APIs and frameworks preferably on GPU servers.
- Highly motivated with strong communication skills, you can work successfully with multi-functional teams, principles, and architects and coordinate effectively across organizational boundaries and geographies.
- 5+ years in similar role and experience on large-scale production systems.
- Experience with common software engineering principles, tools and techniques.
- You possess a BS in Computer Science, Engineering, Physics, Mathematics or a comparable Degree or equivalent experience.
- Technical knowledge, including a systems programming language (Go, Python) and a solid understanding of data structures and algorithms.