
Senior HPC and LSF Operations Engineer
NVIDIA
full-time
Posted on:
Location Type: Hybrid
Location: Santa Clara • California • Massachusetts • United States
Visit company websiteExplore more
Salary
💰 $152,000 - $241,500 per year
Job Level
Tech Stack
About the role
- Manage, scale, and optimize job scheduling systems (LSF, Slurm, etc.) in a large-scale, multi-site environment supporting EDA and other compute-intensive workloads
- Analyze scheduler and infrastructure performance data to identify systemic bottlenecks and drive measurable improvements in utilization, throughput, and turnaround time
- Lead problem solving across scheduler, OS, and workload layers, ensuring timely resolution of service-impacting issues
- Identify recurring operational challenges and implement targeted automation or process improvements to reduce manual effort and prevent repeat incidents
- Help define and track reliable metrics and SLOs for service performance and reliability, partnering with customers to ensure expectations are realistic and measurable
- Contribute to operational standards, documentation, and best practices to improve consistency across sites
- Partner directly with customer teams to clarify requirements, translate technical tradeoffs, and drive issues to closure
Requirements
- Bachelor’s degree in Computer Science or related field, or equivalent experience
- Minimum 5+ years of experience operating and supporting large-scale Linux-based compute infrastructure
- Strong hands-on experience supporting and tuning job scheduling systems (LSF, Slurm, etc.) in HPC or silicon design environments
- Proficiency in Linux systems administration (CentOS/RHEL)
- Strong problem solving skills and the ability to independently analyze complex system behavior under load
- Clear and effective communication skills, including the ability to articulate technical tradeoffs and reliability metrics to engineering stakeholders
Benefits
- Comprehensive benefits package
- Equity options
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
job scheduling systemsLSFSlurmLinux systems administrationCentOSRHELperformance analysisautomationprocess improvementsmetrics tracking
Soft Skills
problem solvingcommunicationanalytical skillscollaborationcustomer partnership
Certifications
Bachelor’s degree in Computer Science