
Senior ML Infrastructure Engineer – AI
Ellison Institute of Technology Oxford
full-time
Posted on:
Location Type: Hybrid
Location: Oxford • 🇬🇧 United Kingdom
Visit company websiteJob Level
Senior
Tech Stack
Terraform
About the role
- Build, operate, and continuously optimise our high-performance GPU training and inference clusters, focusing on robust, high-availability scheduling, isolation, and automated lifecycle management.
- Drive systems design and implementation for high-throughput data paths, optimising I/O, caching, and data locality across compute and storage (including our current Lustre implementation).
- Proactively benchmark, profile, and resolve performance bottlenecks across the compute, network, and orchestration layers to maximise efficiency for distributed training and inference.
- Establish comprehensive observability, resilience, and automated security controls to ensure compliance and robust operation of sensitive research environments.
- Partner with Research, Data, and Applied teams to forecast capacity and cost for GPU and storage needs, setting quotas and streamlining ML experimentation pipelines.
Requirements
- Proven experience leading the design, build, and operation of high-performance ML compute clusters at scale
- A proactive, autonomous approach to systems design and the proven ability and desire to ideate, co-create and implement optimal solutions
- Exposure to migrating or transforming ML infrastructure from traditional schedulers to modern, containerised systems
- Expertise with high-throughput storage systems for ML/HPC workloads
- Expert-level understanding of GPU architecture, high-speed networking for distributed training, and performance profiling to resolve bottlenecks
- A solid grasp of IaC and CI/CD practices (e.g., Terraform, Argo CD)
Benefits
- Enhanced holiday pay
- Pension
- Life Assurance
- Income Protection
- Private Medical Insurance
- Hospital Cash Plan
- Therapy Services
- Perk Box
- Electric Car Scheme
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
GPU architecturehigh-performance ML compute clustersperformance profilinghigh-throughput storage systemsI/O optimizationcachingdata localityIaCCI/CD practices
Soft skills
proactive approachautonomous systems designcollaborationproblem-solving