Ellison Institute of Technology Oxford

Senior ML Infrastructure Engineer – AI

Ellison Institute of Technology Oxford

full-time

Posted on:

Location Type: Hybrid

Location: Oxford • 🇬🇧 United Kingdom

Visit company website
AI Apply
Apply

Job Level

Senior

Tech Stack

Terraform

About the role

  • Build, operate, and continuously optimise our high-performance GPU training and inference clusters, focusing on robust, high-availability scheduling, isolation, and automated lifecycle management.
  • Drive systems design and implementation for high-throughput data paths, optimising I/O, caching, and data locality across compute and storage (including our current Lustre implementation).
  • Proactively benchmark, profile, and resolve performance bottlenecks across the compute, network, and orchestration layers to maximise efficiency for distributed training and inference.
  • Establish comprehensive observability, resilience, and automated security controls to ensure compliance and robust operation of sensitive research environments.
  • Partner with Research, Data, and Applied teams to forecast capacity and cost for GPU and storage needs, setting quotas and streamlining ML experimentation pipelines.

Requirements

  • Proven experience leading the design, build, and operation of high-performance ML compute clusters at scale
  • A proactive, autonomous approach to systems design and the proven ability and desire to ideate, co-create and implement optimal solutions
  • Exposure to migrating or transforming ML infrastructure from traditional schedulers to modern, containerised systems
  • Expertise with high-throughput storage systems for ML/HPC workloads
  • Expert-level understanding of GPU architecture, high-speed networking for distributed training, and performance profiling to resolve bottlenecks
  • A solid grasp of IaC and CI/CD practices (e.g., Terraform, Argo CD)
Benefits
  • Enhanced holiday pay
  • Pension
  • Life Assurance
  • Income Protection
  • Private Medical Insurance
  • Hospital Cash Plan
  • Therapy Services
  • Perk Box
  • Electric Car Scheme

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills
GPU architecturehigh-performance ML compute clustersperformance profilinghigh-throughput storage systemsI/O optimizationcachingdata localityIaCCI/CD practices
Soft skills
proactive approachautonomous systems designcollaborationproblem-solving