Salary
💰 $148,000 - $287,500 per year
Tech Stack
CloudDistributed SystemsGoPython
About the role
- Part of a DGX Cloud team responsible for production systems that enable large scalable GPU clusters for AI workloads
- Work on custom software related to managing fleets of GPU nodes
- Implement monitoring and health management capabilities for reliability, availability, and scalability of GPU assets
- Harness multiple data streams, from GPU hardware diagnostics to cluster and network telemetry
- Work with teams across NVIDIA to ensure production AI clusters run reliably and consistently with maximum performance
- Evaluate system failures and improve services based on a well-defined incident management process
Requirements
- Direct experience in a software engineering role within a highly technical organization with demonstrable impact
- Software development experience with bare metal hardware APIs and frameworks preferably on GPU servers
- Highly motivated with strong communication skills
- Ability to work successfully with multi-functional teams, principles, and architects and coordinate effectively across organizational boundaries and geographies
- 5+ years in similar role and experience on large-scale production systems
- Experience with common software engineering principles, tools and techniques
- BS in Computer Science, Engineering, Physics, Mathematics or a comparable Degree or equivalent experience
- Technical knowledge, including a systems programming language (Go, Python) and a solid understanding of data structures and algorithms