Nebius Group

Senior Site Reliability Engineer, Compute Node Team

Nebius Group

full-time

Posted on:

Location Type: Remote

Location: Netherlands

Visit company website

Explore more

AI Apply
Apply

Job Level

Tech Stack

About the role

  • Ensure reliability, availability and performance of compute nodes running VMs
  • Analyze and debug Linux systems across user space and kernel space, understanding capabilities, limitations and trade-offs at each layer
  • Troubleshoot complex production issues involving CPU, memory, NUMA, cgroups and scheduling
  • Work hands-on with virtualization and containerization, primarily using QEMU/KVM and Linux-native technologies
  • Design and evolve observability as a core capability of the node layer: metrics, logs, traces, alerts, SLIs and SLOs
  • Lead incident response, root-cause analysis, and postmortems, driving long-term reliability improvements
  • Collaborate closely with platform, kernel/hypervisor, GPU and infrastructure teams to improve system design and operability.

Requirements

  • Strong Linux expertise:
  • deep understanding of Linux user space and kernel space
  • knowledge of kernel subsystems (scheduler, memory management, filesystems, cgroups, namespaces)
  • clear understanding of system boundaries and constraints at different layers
  • Virtualization experience:
  • hands-on experience with QEMU/KVM
  • understanding of VM lifecycle, performance characteristics and failure modes
  • Containerization knowledge:
  • practical experience with containers, namespaces and cgroups
  • strong understanding of resource isolation and control
  • Strong debugging skills:
  • ability to reason about complex system failures
  • structured, hypothesis-driven approach to incident analysis
  • SRE mindset:
  • clear understanding of the SRE role in system design and operations
  • experience building and operating observability stacks, not just consuming them
  • ability to turn system behavior into actionable reliability signals.
Benefits
  • Competitive salary and comprehensive benefits package.
  • Opportunities for professional growth within Nebius.
  • Flexible working arrangements.
  • A dynamic and collaborative work environment that values initiative and innovation.
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
LinuxQEMUKVMvirtualizationcontainerizationdebuggingobservabilitymetricslogstraces
Soft Skills
incident responseroot-cause analysiscollaborationstructured approachhypothesis-driven analysis