
Senior Site Reliability Engineer, Compute Node Team
Nebius Group
full-time
Posted on:
Location Type: Remote
Location: Netherlands
Visit company websiteExplore more
Job Level
About the role
- Ensure reliability, availability and performance of compute nodes running VMs
- Analyze and debug Linux systems across user space and kernel space, understanding capabilities, limitations and trade-offs at each layer
- Troubleshoot complex production issues involving CPU, memory, NUMA, cgroups and scheduling
- Work hands-on with virtualization and containerization, primarily using QEMU/KVM and Linux-native technologies
- Design and evolve observability as a core capability of the node layer: metrics, logs, traces, alerts, SLIs and SLOs
- Lead incident response, root-cause analysis, and postmortems, driving long-term reliability improvements
- Collaborate closely with platform, kernel/hypervisor, GPU and infrastructure teams to improve system design and operability.
Requirements
- Strong Linux expertise:
- deep understanding of Linux user space and kernel space
- knowledge of kernel subsystems (scheduler, memory management, filesystems, cgroups, namespaces)
- clear understanding of system boundaries and constraints at different layers
- Virtualization experience:
- hands-on experience with QEMU/KVM
- understanding of VM lifecycle, performance characteristics and failure modes
- Containerization knowledge:
- practical experience with containers, namespaces and cgroups
- strong understanding of resource isolation and control
- Strong debugging skills:
- ability to reason about complex system failures
- structured, hypothesis-driven approach to incident analysis
- SRE mindset:
- clear understanding of the SRE role in system design and operations
- experience building and operating observability stacks, not just consuming them
- ability to turn system behavior into actionable reliability signals.
Benefits
- Competitive salary and comprehensive benefits package.
- Opportunities for professional growth within Nebius.
- Flexible working arrangements.
- A dynamic and collaborative work environment that values initiative and innovation.
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
LinuxQEMUKVMvirtualizationcontainerizationdebuggingobservabilitymetricslogstraces
Soft Skills
incident responseroot-cause analysiscollaborationstructured approachhypothesis-driven analysis