NVIDIA

HPC Operations Engineer

NVIDIA

full-time

Posted on:

Location Type: Hybrid

Location: Santa ClaraCaliforniaMassachusettsUnited States

Visit company website

Explore more

AI Apply
Apply

Salary

💰 $124,000 - $241,500 per year

About the role

  • Provide first-line support for HPC users across scheduling, compute, storage, and access-related issues
  • Troubleshoot job failures, scheduler errors, resource constraints, and performance concerns, driving issues to resolution or appropriate customer concern
  • Perform triage of infrastructure incidents, gathering diagnostics and advancing to subject matter experts (SMEs) when issues extend beyond defined ownership
  • Monitor system health, queues, node status, and service availability to ensure stable daily operations
  • Complete established operational procedures for maintenance, patching, and configuration updates
  • Develop and maintain operational documentation, runbooks, knowledge base articles, and guidelines for users and internal teams
  • Contribute to improving team processes by identifying recurring issues and proposing practical workflow refinements

Requirements

  • Bachelor’s degree in Computer Science, Information Technology, Engineering, or related field, or equivalent experience
  • 2+ years of experience supporting Linux-based production environments
  • Solid Linux systems administration fundamentals (RHEL/CentOS and/or Ubuntu)
  • Ability to troubleshoot technical issues methodically and determine when a customer concern requires attention
  • Experience interacting directly with users in a technical support or operations role
  • Strong written communication skills with the ability to produce clear documentation and procedural guides
  • Demonstrated ability to follow established processes while maintaining attention to detail
  • Foundational scripting or automation experience (e.g., Bash or Python) sufficient to support routine operational tasks
  • Solid understanding of workload schedulers such as LSF, Slurm, or similar systems
  • Strong grasp of network computing supporting infrastructure (NFS, automounter, LDAP)
  • Experience supporting HPC or large-scale compute environments
  • Familiarity with EDA workloads
Benefits
  • comprehensive benefits package
  • equity opportunities
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
Linux systems administrationRHELCentOSUbuntuscriptingBashPythonworkload schedulersLSFSlurm
Soft Skills
troubleshootingwritten communicationattention to detailprocess adherenceuser interactionproblem-solvingworkflow refinementdocumentationcustomer serviceteam collaboration
Certifications
Bachelor’s degree in Computer ScienceBachelor’s degree in Information TechnologyBachelor’s degree in Engineering