HPC Engineer – Generative Biology

Ellison Institute of Technology Oxford

HPC Engineer within the Generative Biology Institute at EIT, improving and scaling scientific computing platforms. Collaborating with research teams to advance engineering biology solutions.

Posted 5/28/2026full-timeOxford • 🇬🇧 United KingdomMid-LevelSeniorWebsite

ATS Keywords

Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills

Linux-based systemsHPCSlurmKubernetesTerraformAnsibleGitCI/CDGPU-accelerated workloadscontainerized HPC

Soft Skills

troubleshootingcommunicationdocumentationcollaborationproactive approachlearning-orientedproblem-solvingtechnical explanationmultidisciplinary teamworkuser support

Tools & Technologies

PrometheusGrafanaOpen OnDemandJupyterLabOCILustreBeeGFSDockerNextflowSnakemake

Industry Keywords

computational biologydata processingAI/MLscientific computingresearch computingcloud infrastructurehigh-performance storagesecurity controlsidentity access managementbatch-computing concepts

Tech Stack

Tools & technologies

AnsibleCloudDockerGrafanaKubernetesLinuxNFSNode.jsPrometheusTerraform

About the role

Key responsibilities & impact

**Key Responsibilities**
- Operate, maintain, and improve GBI’s hybrid HPC platform, including Linux-based compute environments, Slurm/Slinky workloads, Kubernetes/OKE services, Open OnDemand, GPU and CPU partitions, and shared storage.
- Help provision, configure, scale, and validate compute, storage, networking, and platform services using infrastructure as code, configuration management, and automation tools such as Terraform, Helm and Ansible.
- Monitor platform health, capacity, job scheduling, GPU utilisation, storage behaviour, and network performance; investigate issues using tools such as Prometheus and Grafana.
- Support researchers in using our Scientific Computing Platform, including triaging user issues and translating common pain points into platform improvements.
- Build and maintain reproducible runtime environments, container images, and workflow-supporting services for scientific computing workloads, including bioinformatics, AI/ML, data processing, and simulation workflows.
- Contribute to safe rollout and maintenance processes for Slurm images, worker node pools, scheduler configuration, container runtime changes, security updates, and monitoring improvements.
- Create and maintain clear technical documentation, runbooks, validation checks, and issue/PR notes so the platform can be operated consistently and improved safely by the wider team.

Requirements

What you’ll need

**Essential Knowledge, Skills and Experience:**
- Bachelor’s or Master’s degree in Computer Science, Computational Biology, Engineering, Physics, Mathematics, or a related discipline, or equivalent practical experience.
- Hands-on experience supporting or administering Linux-based systems in an HPC, cloud, research, academic, or production environment.
- Working knowledge of HPC or batch-computing concepts, including schedulers, resource requests, queues/partitions, shared filesystems, and multi-user compute environments; Slurm experience is preferred.
- Ability to troubleshoot issues across systems, networking, storage, identity, containers, schedulers, and user workloads, and to follow problems through to a reliable operational fix.
- Experience with scripting, automation, and version-controlled operational changes using tools such as Git, CI/CD, Terraform, Ansible, Helm, or similar.
- Ability to work closely with multidisciplinary research teams, understand scientific computing needs, and deliver practical services that advance scientific goals.
- Strong communication and documentation skills, with the ability to explain technical concepts clearly to scientists, engineers, and non-specialist audiences.
- A proactive, learning-oriented approach suited to a new team building and improving a platform while also operating it day to day.
**Desirable Knowledge, Skills and Experience:**
- Experience operating Slurm clusters, Slinky/slurm-operator, Open OnDemand, JupyterLab services, or other researcher-facing HPC portals and access patterns.
- Experience with Kubernetes or managed Kubernetes platforms such as OCI OKE, EKS, GKE, or AKS, including Helm, Argo CD, operators, services, storage classes, and workload troubleshooting.
- Experience with cloud infrastructure, particularly OCI, and with infrastructure as code and remote execution models such as Terraform Cloud.
- Experience with shared and high-performance storage such as Lustre, BeeGFS, GPFS, NFS, OCI File Storage, object storage, or data movement workflows for large scientific datasets.
- Experience supporting GPU-accelerated workloads, NVIDIA tooling, CUDA-aware environments, DCGM metrics, GPU health monitoring, and/or AI/ML and bioinformatics workloads on shared compute platforms.
- Experience with containerised HPC and scientific workflow tooling, such as Apptainer/Singularity, Docker/Podman, Pyxis/Enroot, Nextflow, Snakemake, CWL, or WDL.
- Experience building monitoring and operational dashboards using Prometheus, Grafana, exporter metrics, alerting rules, or capacity and reliability reporting.
- Familiarity with identity, access, and security controls in Linux or research environments, such as OIDC, Okta ASA/PAM, least-privilege access, and security patching.
- Experience working in a scientific, academic, life-science, or research computing environment where requirements evolve through close collaboration with researchers.

Benefits

Comp & perks

**Our Benefits:**
- Salary: Competitive + travel allowance + bonus
- Enhanced holiday pay
- Pension
- Life Assurance
- Income Protection
- Private Medical Insurance
- Hospital Cash Plan
- Therapy Services
- Perk Box
- Electric Car Scheme
**Working Together – What It Involves:**
- You must have the right to work permanently in the UK with a willingness to travel as necessary. In certain cases, we can consider sponsorship, and this will be assessed on a case-by-case basis.
- You will live in, or within easy commuting distance of, Oxford (or be willing to relocate).
- Hybrid working