NVIDIA

Senior Software Engineer, Bare Metal Automation

NVIDIA

full-time

Posted on:

Origin:  • 🇺🇸 United States • California

Visit company website
AI Apply
Apply

Salary

💰 $148,000 - $287,500 per year

Job Level

Senior

Tech Stack

CloudDistributed SystemsGoPython

About the role

  • Part of a DGX Cloud team responsible for production systems that enable large scalable GPU clusters for AI workloads
  • Work on custom software related to managing fleets of GPU nodes
  • Implement monitoring and health management capabilities for reliability, availability, and scalability of GPU assets
  • Harness multiple data streams, from GPU hardware diagnostics to cluster and network telemetry
  • Work with teams across NVIDIA to ensure production AI clusters run reliably and consistently with maximum performance
  • Evaluate system failures and improve services based on a well-defined incident management process

Requirements

  • Direct experience in a software engineering role within a highly technical organization with demonstrable impact
  • Software development experience with bare metal hardware APIs and frameworks preferably on GPU servers
  • Highly motivated with strong communication skills
  • Ability to work successfully with multi-functional teams, principles, and architects and coordinate effectively across organizational boundaries and geographies
  • 5+ years in similar role and experience on large-scale production systems
  • Experience with common software engineering principles, tools and techniques
  • BS in Computer Science, Engineering, Physics, Mathematics or a comparable Degree or equivalent experience
  • Technical knowledge, including a systems programming language (Go, Python) and a solid understanding of data structures and algorithms