Ellison Institute of Technology Oxford

Senior ML Infrastructure Engineer

Ellison Institute of Technology Oxford

full-time

Posted on:

Location Type: Hybrid

Location: Oxford • 🇬🇧 United Kingdom

Visit company website
AI Apply
Apply

Job Level

Senior

Tech Stack

Terraform

About the role

  • **Day-to-day, you might:**
  • - Build, operate, and continuously optimise our high-performance GPU training and inference clusters, focusing on robust, high-availability scheduling, isolation, and automated lifecycle management.
  • - Drive systems design and implementation for high-throughput data paths, optimising I/O, caching, and data locality across compute and storage (including our current Lustre implementation).
  • - Proactively benchmark, profile, and resolve performance bottlenecks across the compute, network, and orchestration layers to maximise efficiency for distributed training and inference.
  • - Establish comprehensive observability, resilience, and automated security controls to ensure compliance and robust operation of sensitive research environments.
  • - Partner with Research, Data, and Applied teams to forecast capacity and cost for GPU and storage needs, setting quotas and streamlining ML experimentation pipelines.

Requirements

  • **What makes you a great fit:**
  • - Proven experience leading the design, build, and operation of high-performance ML compute clusters at scale
  • - A proactive, autonomous approach to systems design and the proven ability and desire to ideate, co-create and implement optimal solutions
  • - Exposure to migrating or transforming ML infrastructure from traditional schedulers to modern, containerised systems
  • - Expertise with high-throughput storage systems for ML/HPC workloads
  • - Expert-level understanding of GPU architecture, high-speed networking for distributed training, and performance profiling to resolve bottlenecks
  • - A solid grasp of IaC and CI/CD practices (e.g., Terraform, Argo CD)
  • **It would also be great if you had:**
  • - Experience with Lustre
Benefits
  • **We offer the following salary and benefits:**
  • Enhanced holiday pay
  • Pension
  • Life Assurance
  • Income Protection
  • Private Medical Insurance
  • Hospital Cash Plan
  • Therapy Services
  • Perk Box
  • Electric Car Scheme
  • **Why work for EIT:**
  • At the Ellison Institute, we believe a collaborative, inclusive team is key to our success. We are building a supportive environment where creative risks are encouraged, and everyone feels heard. We value emotional intelligence, empathy, respect, and resilience, and encourage people to be curious and to have a shared commitment to excellence. Join us and make an impact!

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills
GPU architecturehigh-performance ML compute clustersperformance profilinghigh-throughput storage systemsI/O optimizationcachingdata localityautomated lifecycle managementIaCCI/CD
Soft skills
proactive approachautonomouscollaborationproblem-solvingideationco-creationimplementation