Apply

Ready to go for it?

AI Apply speeds things up—apply directly if you prefer.

FREE ACCESS
5,000–10,000 jobs/day
JobTailor Logo

See all jobs on JobTailor

Search thousands of fresh jobs every day.

Discover
  • Fresh listings
  • Fast filters
  • No subscription required
Create a free account and start exploring right away.
Nebius Group

ML Infrastructure Engineer

Nebius Group

ML Infrastructure Engineer at Nebius leading and supporting GPU benchmarking for machine learning and AI workloads. Collaborating with hardware and development teams to optimize performance and drive hardware development.

Posted 5/12/2026full-timeRemote • California • 🇺🇸 United StatesMid-LevelSeniorWebsite

Tech Stack

Tools & technologies
DockerKubernetesPyTorch

About the role

Key responsibilities & impact
  • Work closely with hardware, development teams to profile and analyse GPU performance at the system and kernel level.
  • Evaluate and compare GPU performance across different platforms, architectures, and software stacks (e.g.,CUDA, ROCm).
  • Debug and optimise ML workloads to run efficiently on GPU hardware, identifying and resolving performance bottlenecks.
  • Perform acceptance testing for new GPU clusters, ensuring hardware and software meet performance, stability, and compatibility requirements for AI workloads.
  • Perform experiments across diverse GPU system configurations to assess the impact of varying interconnect strategies and system-level optimisations on performance and scalability.
  • Develop tools and dashboards to visualise performance metrics, bottlenecks, and trends.
  • Contribute to internal tooling, frameworks, and best practices

Requirements

What you’ll need
  • A profound understanding of theoretical foundations of machine learning
  • Deep understanding of performance aspects of large neural networks training and inference (data/tensor/context/expert parallelism, offloading, custom kernels, hardware features, attention optimisations, dynamic batching etc.)
  • Deep experience with modern deep learning frameworks (PyTorch, JAX, Megatron-LM, Tensort-LLM)
  • Good understanding of the GPU stack: CUDA,NCCL, drivers, and relevant libraries
  • Familiarity with containerized environments (e.g., Docker, Kubernetes).
  • Strong communication and ability to work independently

Benefits

Comp & perks
  • Competitive compensation
  • Career growth and learning opportunities
  • Flexibility and work-life balance
  • Collaborative and innovative culture
  • Opportunity to work on impactful AI projects
  • International environment and talented teams

ATS Keywords

✓ Tailor your resume
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
GPU performance analysisML workload optimizationneural networksdeep learning frameworksCUDANCCLcustom kernelsperformance bottlenecksdynamic batchinginterconnect strategies
Soft Skills
strong communicationindependent work