ML Infrastructure Engineer

Nebius Group

ML Infrastructure Engineer at Nebius leading and supporting GPU benchmarking for machine learning and AI workloads. Collaborating with hardware and development teams to optimize performance and drive hardware development.

Posted 5/12/2026full-timeRemote • California • 🇺🇸 United StatesMid-LevelSeniorWebsite

Tech Stack

Tools & technologies

DockerKubernetesPyTorch

About the role

Key responsibilities & impact

Work closely with hardware, development teams to profile and analyse GPU performance at the system and kernel level.
Evaluate and compare GPU performance across different platforms, architectures, and software stacks (e.g.,CUDA, ROCm).
Debug and optimise ML workloads to run efficiently on GPU hardware, identifying and resolving performance bottlenecks.
Perform acceptance testing for new GPU clusters, ensuring hardware and software meet performance, stability, and compatibility requirements for AI workloads.
Perform experiments across diverse GPU system configurations to assess the impact of varying interconnect strategies and system-level optimisations on performance and scalability.
Develop tools and dashboards to visualise performance metrics, bottlenecks, and trends.
Contribute to internal tooling, frameworks, and best practices

Requirements

What you’ll need

A profound understanding of theoretical foundations of machine learning
Deep understanding of performance aspects of large neural networks training and inference (data/tensor/context/expert parallelism, offloading, custom kernels, hardware features, attention optimisations, dynamic batching etc.)
Deep experience with modern deep learning frameworks (PyTorch, JAX, Megatron-LM, Tensort-LLM)
Good understanding of the GPU stack: CUDA,NCCL, drivers, and relevant libraries
Familiarity with containerized environments (e.g., Docker, Kubernetes).
Strong communication and ability to work independently

Benefits

Comp & perks

Competitive compensation
Career growth and learning opportunities
Flexibility and work-life balance
Collaborative and innovative culture
Opportunity to work on impactful AI projects
International environment and talented teams

ATS Keywords

✓ Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

GPU performance analysisML workload optimizationneural networksdeep learning frameworksCUDANCCLcustom kernelsperformance bottlenecksdynamic batchinginterconnect strategies

Soft Skills

strong communicationindependent work