NVIDIA

Capacity Operations and Analytics Manager

NVIDIA

full-time

Posted on:

Origin:  • 🇺🇸 United States • California

Visit company website
AI Apply
Manual Apply

Salary

💰 $200,000 - $322,000 per year

Job Level

SeniorLead

Tech Stack

AWSAzureCloudGoogle Cloud PlatformGrafanaPrometheusSplunkTableau

About the role

  • Manage and optimize GPU capacity and other compute resources across various cloud service providers to meet growing demands and ensure efficient utilization.
  • Build, develop, and maintain data models, reporting systems, data automation systems, dashboards, and performance metrics that support NVIDIA Infrastructure governance programs and strategic capacity decisions.
  • Analyze the technical and business needs for GPU capacity and other compute resources from various internal and external teams.
  • Identify performance bottlenecks in day-to-day usage of compute resources and collaborate with relevant infrastructure teams to resolve them.
  • Drive infrastructure resource efficiency initiatives in partnership with engineering, finance, and product teams.
  • Develop and enhance tooling for our cloud infrastructure and analytics platform to optimize resource usage and performance for NVIDIA and its customers.
  • Partner and cross-collaborate with Finance, Product, Service Owners, and Infrastructure Engineering teams to align cloud capacity management with company goals and develop Infrastructure and Service Level Key Performance Indicators (KPIs) to match Customer satisfaction.
  • Lead multi-year budget-based compute resource planning with engineering.

Requirements

  • Bachelor's or Master's degree in Computer Science, Software Engineering, or a related field, or equivalent experience.
  • 12+ years of overall experience in cloud computing, specifically in managing or sourcing GPU capacity with cloud service providers.
  • Experience with Cloud Service Providers such as AWS, Azure, GCP, and OCI is required.
  • Strong technical proficiency in cloud architecture, development and deployment, and managing large data sets.
  • Deep understanding of cloud service models (IaaS, PaaS, SaaS) and cloud infrastructure technologies.
  • Demonstrated experience in leveraging AI tools and techniques to extract useful signals and insights from data.
  • Strong understanding and practical application of statistical modeling and machine learning methodologies.
  • Proficiency with data analytics, visualization, and monitoring tools such as Kibana, Grafana, Splunk, Prometheus, Tableau, Plotly.
  • Excellent communication and interpersonal skills.
  • Ability to operate effectively amidst uncertainty and rapidly changing business conditions.