NVIDIA

Senior HPC and LSF Operations Engineer

NVIDIA

full-time

Posted on:

Location Type: Hybrid

Location: Santa ClaraCaliforniaMassachusettsUnited States

Visit company website

Explore more

AI Apply
Apply

Salary

💰 $152,000 - $241,500 per year

Job Level

Tech Stack

About the role

  • Manage, scale, and optimize job scheduling systems (LSF, Slurm, etc.) in a large-scale, multi-site environment supporting EDA and other compute-intensive workloads
  • Analyze scheduler and infrastructure performance data to identify systemic bottlenecks and drive measurable improvements in utilization, throughput, and turnaround time
  • Lead problem solving across scheduler, OS, and workload layers, ensuring timely resolution of service-impacting issues
  • Identify recurring operational challenges and implement targeted automation or process improvements to reduce manual effort and prevent repeat incidents
  • Help define and track reliable metrics and SLOs for service performance and reliability, partnering with customers to ensure expectations are realistic and measurable
  • Contribute to operational standards, documentation, and best practices to improve consistency across sites
  • Partner directly with customer teams to clarify requirements, translate technical tradeoffs, and drive issues to closure

Requirements

  • Bachelor’s degree in Computer Science or related field, or equivalent experience
  • Minimum 5+ years of experience operating and supporting large-scale Linux-based compute infrastructure
  • Strong hands-on experience supporting and tuning job scheduling systems (LSF, Slurm, etc.) in HPC or silicon design environments
  • Proficiency in Linux systems administration (CentOS/RHEL)
  • Strong problem solving skills and the ability to independently analyze complex system behavior under load
  • Clear and effective communication skills, including the ability to articulate technical tradeoffs and reliability metrics to engineering stakeholders
Benefits
  • Comprehensive benefits package
  • Equity options
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
job scheduling systemsLSFSlurmLinux systems administrationCentOSRHELperformance analysisautomationprocess improvementsmetrics tracking
Soft Skills
problem solvingcommunicationanalytical skillscollaborationcustomer partnership
Certifications
Bachelor’s degree in Computer Science