NVIDIA

Senior AI Infrastructure Engineer - DGX Cloud

NVIDIA

full-time

Posted on:

Origin:  • 🇺🇸 United States • California, Washington

Visit company website
AI Apply
Manual Apply

Salary

💰 $184,000 - $356,500 per year

Job Level

Senior

Tech Stack

CloudDistributed SystemsGoJavaKubernetesLinuxNode.jsOpen SourceOpenStackPythonTerraform

About the role

  • Design, build, deploy, and run internal tooling for large scale AI training and Inferencing platform built on top of cloud infrastructure
  • Conduct in-depth performance characterization and analysis on large multi-GPU and multi-node clusters.
  • Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation and refinement.
  • Support services before they go live through activities such as system design consulting, developing software tools, platforms and frameworks, capacity management and launch reviews.
  • Maintain services once they are live by measuring and monitoring availability, latency and overall system health.
  • Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity
  • Practice sustainable incident response and blameless postmortems
  • Be part of an on call rotation to support production systems

Requirements

  • BS degree in Computer Science or a related technical field involving coding (e.g., physics or mathematics), or equivalent experience.
  • 6+ years of experience.
  • A track record showing a good balance between initiating your own projects, convincing others to collaborate with you and collaborating well on projects initiated by others.
  • Experience with infrastructure automation and distributed systems design developing tools for running large scale private or public cloud systems in production.
  • Experience in one or more of the following: Python, Go, C/C++, Java
  • In depth knowledge in one or more of Linux, Networking, Storage, and Containers Technologies
  • Experience with Public Cloud and Infrastructure as a code (IAAC) and Terraform
  • Distributed system experience