FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.

ML Infrastructure Engineer
Nebius GroupML Infrastructure Engineer at Nebius leading and supporting GPU benchmarking for machine learning and AI workloads. Collaborating with hardware and development teams to optimize performance and drive hardware development.
Tech Stack
Tools & technologiesDockerKubernetesPyTorch
About the role
Key responsibilities & impact- Work closely with hardware, development teams to profile and analyse GPU performance at the system and kernel level.
- Evaluate and compare GPU performance across different platforms, architectures, and software stacks (e.g.,CUDA, ROCm).
- Debug and optimise ML workloads to run efficiently on GPU hardware, identifying and resolving performance bottlenecks.
- Perform acceptance testing for new GPU clusters, ensuring hardware and software meet performance, stability, and compatibility requirements for AI workloads.
- Perform experiments across diverse GPU system configurations to assess the impact of varying interconnect strategies and system-level optimisations on performance and scalability.
- Develop tools and dashboards to visualise performance metrics, bottlenecks, and trends.
- Contribute to internal tooling, frameworks, and best practices
Requirements
What you’ll need- A profound understanding of theoretical foundations of machine learning
- Deep understanding of performance aspects of large neural networks training and inference (data/tensor/context/expert parallelism, offloading, custom kernels, hardware features, attention optimisations, dynamic batching etc.)
- Deep experience with modern deep learning frameworks (PyTorch, JAX, Megatron-LM, Tensort-LLM)
- Good understanding of the GPU stack: CUDA,NCCL, drivers, and relevant libraries
- Familiarity with containerized environments (e.g., Docker, Kubernetes).
- Strong communication and ability to work independently
Benefits
Comp & perks- Competitive compensation
- Career growth and learning opportunities
- Flexibility and work-life balance
- Collaborative and innovative culture
- Opportunity to work on impactful AI projects
- International environment and talented teams
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
GPU performance analysisML workload optimizationneural networksdeep learning frameworksCUDANCCLcustom kernelsperformance bottlenecksdynamic batchinginterconnect strategies
Soft Skills
strong communicationindependent work