
Senior ML Infrastructure Engineer
Ellison Institute of Technology Oxford
full-time
Posted on:
Location Type: Hybrid
Location: Oxford • 🇬🇧 United Kingdom
Visit company websiteJob Level
Senior
Tech Stack
Terraform
About the role
- **Day-to-day, you might:**
- - Build, operate, and continuously optimise our high-performance GPU training and inference clusters, focusing on robust, high-availability scheduling, isolation, and automated lifecycle management.
- - Drive systems design and implementation for high-throughput data paths, optimising I/O, caching, and data locality across compute and storage (including our current Lustre implementation).
- - Proactively benchmark, profile, and resolve performance bottlenecks across the compute, network, and orchestration layers to maximise efficiency for distributed training and inference.
- - Establish comprehensive observability, resilience, and automated security controls to ensure compliance and robust operation of sensitive research environments.
- - Partner with Research, Data, and Applied teams to forecast capacity and cost for GPU and storage needs, setting quotas and streamlining ML experimentation pipelines.
Requirements
- **What makes you a great fit:**
- - Proven experience leading the design, build, and operation of high-performance ML compute clusters at scale
- - A proactive, autonomous approach to systems design and the proven ability and desire to ideate, co-create and implement optimal solutions
- - Exposure to migrating or transforming ML infrastructure from traditional schedulers to modern, containerised systems
- - Expertise with high-throughput storage systems for ML/HPC workloads
- - Expert-level understanding of GPU architecture, high-speed networking for distributed training, and performance profiling to resolve bottlenecks
- - A solid grasp of IaC and CI/CD practices (e.g., Terraform, Argo CD)
- **It would also be great if you had:**
- - Experience with Lustre
Benefits
- **We offer the following salary and benefits:**
- Enhanced holiday pay
- Pension
- Life Assurance
- Income Protection
- Private Medical Insurance
- Hospital Cash Plan
- Therapy Services
- Perk Box
- Electric Car Scheme
- **Why work for EIT:**
- At the Ellison Institute, we believe a collaborative, inclusive team is key to our success. We are building a supportive environment where creative risks are encouraged, and everyone feels heard. We value emotional intelligence, empathy, respect, and resilience, and encourage people to be curious and to have a shared commitment to excellence. Join us and make an impact!
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
GPU architecturehigh-performance ML compute clustersperformance profilinghigh-throughput storage systemsI/O optimizationcachingdata localityautomated lifecycle managementIaCCI/CD
Soft skills
proactive approachautonomouscollaborationproblem-solvingideationco-creationimplementation