
Capacity Operations Manager
NVIDIA
full-time
Posted on:
Location Type: Remote
Location: California • United States
Visit company websiteExplore more
Salary
💰 $136,000 - $218,500 per year
About the role
- Coordinate the development of High Performance Computing (HPC) clusters, collaborating closely with internal and external engineering teams.
- Direct and improve GPU capacity and additional compute resources across diverse cloud service platforms to satisfy rising needs and secure efficient deployment.
- Design, improve, and manage data models, reporting platforms, data automation solutions, dashboards, and performance measures that back NVIDIA Infrastructure governance programs and strategic capacity decisions.
- Assess the technical and business requirements for GPU capacity and other compute resources from different internal and external groups.
- Identify performance bottlenecks in day-to-day usage of compute resources and collaborate with relevant infrastructure teams to resolve them.
- Drive infrastructure resource efficiency initiatives in partnership with engineering, finance, and product teams.
- Develop and enhance tooling for our cloud infrastructure and analytics platform to optimize resource usage and performance for NVIDIA and its customers.
- This includes crafting and developing tools for automating workflows and potentially bringing to bear AI techniques to extract useful signals and insights from generated data.
- Partner and cross-collaborate with Finance, Product, Service Owners, and Infrastructure Engineering teams to align cloud capacity management with company goals and develop Infrastructure and Service Level benchmarks to match Customer satisfaction.
Requirements
- Bachelor's or Master's degree in Computer Science, Software Engineering, or a related field, or equivalent experience.
- 8+ years of overall experience in cloud computing, specifically in managing or using GPU capacity for high performance computing.
- A proven record of large-scale computing operations and planning is a plus.
- Strong technical proficiency in cloud architecture, development and deployment, and managing large data sets.
- Experience with command line interfaces and shell scripting languages.
- Comprehensive knowledge of cloud service models (IaaS, PaaS, SaaS) and cloud infrastructure technologies.
- Practical experience with Cloud Service Providers including AWS, Azure, GCP, and OCI is essential.
- Demonstrated experience in bringing to bear AI tools and techniques to extract useful signals and insights from data, specifically to improve resource usage and automation.
- Deep knowledge and active use of statistical modeling and machine learning approaches for boosting operational efficiency and supporting strategic capacity decisions.
- Understanding of analytics, statistical modeling, and machine learning methodologies.
- Strong communication and relationship-building skills, with the ability to work well across different departments and contribute to strategic decisions.
- Self-starter, self-motivated, focused, and self-sufficient, with a willingness to learn new challenges and adapt quickly in a dynamic environment.
- Ability to operate effectively amidst uncertainty and rapidly changing business conditions, with an agile approach and a commitment to ongoing improvement.
Benefits
- equity
- benefits 📊 Check your resume score for this job Improve your chances of getting an interview by checking your resume score before you apply. Check Resume Score
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
High Performance Computing (HPC)GPU capacity managementcloud architecturedata modelingshell scriptingcloud service modelsstatistical modelingmachine learningdata automationanalytics
Soft Skills
communicationrelationship-buildingself-starterself-motivatedadaptabilityproblem-solvingcollaborationstrategic thinkingfocuscommitment to improvement
Certifications
Bachelor's degree in Computer ScienceMaster's degree in Software Engineering