Site Reliability Engineer

NICE

SRE - NOC role focuses on service reliability, incident response, and operational automation. Precision in dealing with operational toil through engineering practices for global operations at NICE.

Posted 4/22/2026full-timeRemote • 🇬🇧 United KingdomMid-LevelSeniorWebsite

Tech Stack

Tools & technologies

AnsibleAWSCloudDNSDockerGoGrafanaKubernetesLinuxPrometheusPythonSplunkTCP/IPTerraform

About the role

Key responsibilities & impact

Act as a primary or escalation responder in a 24x7 on-call rotation
Lead or support Major Incident (MI) response, including triage, mitigation, and resolution
Coordinate across Engineering, Infrastructure, Security, and Product teams
Execute and improve runbooks, playbooks, and escalation paths
Drive blameless post-incident reviews (PIRs) and track corrective actions
Own service health monitoring across infrastructure, applications, and dependencies
Design and maintain alerting strategies that align with SLIs/SLOs
Reduce alert fatigue through signal-to-noise improvements
Build dashboards using tools such as Grafana, Prometheus, Datadog, Splunk, CloudWatch
Automate repetitive operational tasks to reduce manual toil
Improve mean time to detect (MTTD) and mean time to resolve (MTTR)
Develop scripts and tools (Python, Bash, Go, etc.) to support NOC/SRE workflows
Implement self-healing and auto-remediation where possible
Partner with engineering teams to improve system design for reliability
Support and troubleshoot Linux-based systems, cloud platforms, Kubernetes/containerized environments
Assist with capacity planning and availability reviews
Ensure operational readiness for production releases

Requirements

What you’ll need

Strong Linux systems administration
Experience with incident management and production support
Familiarity with cloud infrastructure (AWS preferred)
Containers & orchestration (Docker, Kubernetes)
Monitoring/alerting platforms
Scripting or programming experience in Python, Bash, Go, or similar
Understanding of networking fundamentals (DNS, TCP/IP, load balancing)
Experience working in 24x7 NOC or production operations environments
Ability to handle high-pressure incidents calmly and effectively
Strong written and verbal communication for incident coordination
Comfort working from runbooks—but improving them when they fall short
Experience defining or operating to SLOs / SLIs
Prior migration from traditional NOC → SRE model
Infrastructure as Code experience (Terraform, Ansible, etc.)
Exposure to security, compliance, or regulated environments

Benefits

Comp & perks

Professional development opportunities
Flexible working hours
Work from home

ATS Keywords

✓ Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

Linux systems administrationincident managementcloud infrastructurecontainersorchestrationscriptingprogrammingnetworking fundamentalsInfrastructure as Codemonitoring

Soft Skills

calm under pressurewritten communicationverbal communicationincident coordinationrunbook improvement