Tech Stack
AWSCloudGoogle Cloud PlatformLinuxPython
About the role
- Administer and maintain internal HPC clusters using Slurm for workload management
- Collaborate with research teams to support compute-heavy workflows
- Ensure high availability and performance of infrastructure systems
- Monitor, troubleshoot, and resolve system and network issues
- Manage user accounts, system security, and access controls
- Automate routine tasks and improve system processes through scripting
- Maintain and document configurations, procedures, and system changes
- Participate in infrastructure upgrades and scaling initiatives
- Provide expert support for Linux-based systems in a hybrid cloud environment
- Start as part-time (20h/week) for first 3 months, then ramp to full-time (40h/week)
Requirements
- 3+ years of experience as a Systems Administrator or similar role
- Strong experience with Slurm workload manager and cluster administration (Slurm is a must)
- Solid knowledge of Linux system internals, storage, and networking
- Proven experience supporting researchers or scientific computing teams
- Familiarity with configuration management tools and scripting (Bash, Python, etc.)
- Comfortable working independently in a remote, asynchronous environment
- Strong problem-solving and communication skills
- Experience with cloud infrastructure (e.g., AWS/GCP) and hybrid cloud environments
- Experience managing HPC clusters
- English is mandatory