
HPC Operations Engineer
NVIDIA
full-time
Posted on:
Location Type: Hybrid
Location: Santa Clara • California • Massachusetts • United States
Visit company websiteExplore more
Salary
💰 $124,000 - $241,500 per year
About the role
- Provide first-line support for HPC users across scheduling, compute, storage, and access-related issues
- Troubleshoot job failures, scheduler errors, resource constraints, and performance concerns, driving issues to resolution or appropriate customer concern
- Perform triage of infrastructure incidents, gathering diagnostics and advancing to subject matter experts (SMEs) when issues extend beyond defined ownership
- Monitor system health, queues, node status, and service availability to ensure stable daily operations
- Complete established operational procedures for maintenance, patching, and configuration updates
- Develop and maintain operational documentation, runbooks, knowledge base articles, and guidelines for users and internal teams
- Contribute to improving team processes by identifying recurring issues and proposing practical workflow refinements
Requirements
- Bachelor’s degree in Computer Science, Information Technology, Engineering, or related field, or equivalent experience
- 2+ years of experience supporting Linux-based production environments
- Solid Linux systems administration fundamentals (RHEL/CentOS and/or Ubuntu)
- Ability to troubleshoot technical issues methodically and determine when a customer concern requires attention
- Experience interacting directly with users in a technical support or operations role
- Strong written communication skills with the ability to produce clear documentation and procedural guides
- Demonstrated ability to follow established processes while maintaining attention to detail
- Foundational scripting or automation experience (e.g., Bash or Python) sufficient to support routine operational tasks
- Solid understanding of workload schedulers such as LSF, Slurm, or similar systems
- Strong grasp of network computing supporting infrastructure (NFS, automounter, LDAP)
- Experience supporting HPC or large-scale compute environments
- Familiarity with EDA workloads
Benefits
- comprehensive benefits package
- equity opportunities
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
Linux systems administrationRHELCentOSUbuntuscriptingBashPythonworkload schedulersLSFSlurm
Soft Skills
troubleshootingwritten communicationattention to detailprocess adherenceuser interactionproblem-solvingworkflow refinementdocumentationcustomer serviceteam collaboration
Certifications
Bachelor’s degree in Computer ScienceBachelor’s degree in Information TechnologyBachelor’s degree in Engineering