Administer and maintain Linux-based HPC clusters, including compute nodes, head nodes, and storage systems.
Monitor system health, performance, and resource utilization to ensure high availability and efficiency.
Manage job schedulers and resource managers (e.g., Slurm, PBS, or Torque).
Configure and maintain high-speed storage and parallel file systems (e.g., Lustre, BeeGFS, or GPFS).
Ensure cluster security and compliance, including user access management, software patching, and vulnerability monitoring.
Install, update, and optimize scientific software modules and libraries (e.g., via Spack, EasyBuild, or environment modules).
Develop automation scripts (Bash, Python) to streamline administrative tasks.
Perform backup and disaster recovery planning for HPC systems and research data.
Collaborate with researchers to troubleshoot complex computing workflows and improve job throughput.
Document HPC procedures, best practices, and system changes for team and user reference.
Support enrollment growth, student retention, and campus safety by maintaining reliable computational resources.
Requirements
Bachelor’s degree in Computer Science, Engineering, or a related field
A minimum of three (3) years of Linux system administration experience in a multi-user environment.
A minimum of three (3) years of experience in a combination of the following: Hands-on experience with HPC clusters, job schedulers, and parallel computing.
Familiarity with parallel file systems, storage management, and networked environments (Infiniband, Ethernet).
Experience with system monitoring and performance tuning in Linux environments.
Proficiency in scripting with Bash or Python for automation and system management.
Any equivalent combination of related education and/or experience will be considered.
All qualifications must be met by the time of employment.
Familiarity with parallel file systems, storage management, and networked environments (Infiniband, Ethernet).
Experience with system monitoring and performance tuning in Linux environments.
Strong troubleshooting and documentation skills, with the ability to collaborate effectively with researchers.
Experience with GPU-enabled nodes and CUDA or ROCm environments.
Familiarity with HPC software stacks, scientific libraries, and containerized workflows (e.g., Singularity/Apptainer).
Knowledge of data security requirements for research, such as HIPAA, FISMA, or CUI.
Prior experience supporting HPC environments in academia or research settings (preferred)