Manager, Site Reliability Engineering

Veeam Software

full-time

Posted on: 1/21/2026

Location Type: Remote

Location: Czech

Visit company website

Explore more

DevOps Engineer jobs

✨ AI Apply

Apply

Job Level

Senior Lead

Tech Stack

Azure Cloud Grafana Kubernetes Prometheus Terraform

About the role

Hire, onboard, and grow your SRE team; coach career development and performance
Foster a psychologically safe, blameless culture that favors learning over blame and emphasizes engineering over firefighting
Ensure a sustainable operational coverage; monitor on-call health and workload
Track and cap toil so engineers spend the majority of time on project work that reduces future toil
Establish and operationalize SLIs/SLOs and error budgets with service owners; run reliability reviews and hold teams accountable to outcomes
Define reliability standards, runbooks, readiness checklists, and alerting patterns (including SLO-based alerting)
Partner with product/EMs to align reliability work with service goals and customer experience, not as a gate but as an enabler
Ensure incident response readiness; lead/coordinate major incidents; drive fast, high-quality postmortems and systemic fixes
Measure MTTR, change failure rate, SLO posture, and repeat-incident reduction; publish learning broadly
Lead software-first reliability investments: observability, deployment safety (canary/blue-green), resilience testing/chaos, and self-service guardrails
Drive platform improvements (IaC, CI/CD, Kubernetes) and internal tools that scale operations and improve developer experience

Requirements

7+ years in Software, Platform, and/or Reliability Engineering with 2+ years managing engineers
Demonstrable experience leading engineering teams to predictably deliver outcomes
Experience leading cross-functional initiatives collaboratively with peers through influence
Experience with public cloud (Azure preferred), Kubernetes, IaC (Terraform, Pulumi), CI/CD (Github Actions, ArgoCD, Azure DevOps), and observability (OpenTelemetry, Elastic, Datadog, Prometheus, Grafana)
Coding background with experience improving service reliability
Hands-on incident management and postmortem practice; excellent cross-geo communication
Willingness to participate in an on-call rotation (typically during daytime hours, including weekends/holidays)

Benefits

25 vacation days, 4 sick days, 21 paid medical leave days, plus 4 extra global VeeaMe Days for self-care and 24 paid volunteer hours annually through Veeam Cares
Premium private medical insurance for employees and dependents
Daily meal vouchers for restaurants and groceries (180 CZK per working day)
Flexible cafeteria platform with thousands of lifestyle benefit options
Multisport Card for gym and wellness, with family add-on options
Annual public transport reimbursement up to a set limit
Corporate mobile plan with optional family tariff
Opportunities to learn and grow through on-demand libraries (LinkedIn Learning, O’Reilly), mentoring, workshops and learning events like our annual Global Day of Learning

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

Software EngineeringPlatform EngineeringReliability EngineeringIncident ManagementService ReliabilityObservabilityInfrastructure as CodeContinuous IntegrationContinuous DeploymentPostmortem Practice

Soft Skills

Team LeadershipCoachingCollaborationInfluenceCommunicationPsychological SafetyPerformance ManagementCross-Functional LeadershipProblem SolvingCultural Development