Salary
💰 $148,000 - $287,500 per year
About the role
- Part of an DGX Cloud team responsible for production systems that enable large scalable GPU clusters to be used for a variety of AI workloads.
- Working on custom software related to managing fleets of GPU nodes.
- Implementing monitoring and health management capabilities that enable industry leading reliability, availability, and scalability of GPU assets.
- Harnessing multiple data streams, ranging from GPU hardware diagnostics to cluster and network telemetry.
- Working with teams across NVIDIA to ensure production AI clusters run reliability and consistently with maximum performance.
- Evaluating system failures and improving services based on a well-defined incident management process.
Requirements
- Direct experience in a software engineering role within a highly technical organization with demonstrable impact from your work.
- Software development experience with bare metal hardware APIs and frameworks preferably on GPU servers.
- Highly motivated with strong communication skills, you can work successfully with multi-functional teams, principles, and architects and coordinate effectively across organizational boundaries and geographies.
- 5+ years in similar role and experience on large-scale production systems.
- Experience with common software engineering principles, tools and techniques.
- You possess a BS in Computer Science, Engineering, Physics, Mathematics or a comparable Degree or equivalent experience.
- Technical knowledge, including a systems programming language (Go, Python) and a solid understanding of data structures and algorithms.