Apply

Ready to go for it?

AI Apply speeds things up—apply directly if you prefer.

FREE ACCESS
5,000–10,000 jobs/day
JobTailor Logo

See all jobs on JobTailor

Search thousands of fresh jobs every day.

Discover
  • Fresh listings
  • Fast filters
  • No subscription required
Create a free account and start exploring right away.
NICE

Site Reliability Engineer

NICE

SRE - NOC role focuses on service reliability, incident response, and operational automation. Precision in dealing with operational toil through engineering practices for global operations at NICE.

Posted 4/22/2026full-timeRemote • 🇬🇧 United KingdomMid-LevelSeniorWebsite

Tech Stack

Tools & technologies
AnsibleAWSCloudDNSDockerGoGrafanaKubernetesLinuxPrometheusPythonSplunkTCP/IPTerraform

About the role

Key responsibilities & impact
  • Act as a primary or escalation responder in a 24x7 on-call rotation
  • Lead or support Major Incident (MI) response, including triage, mitigation, and resolution
  • Coordinate across Engineering, Infrastructure, Security, and Product teams
  • Execute and improve runbooks, playbooks, and escalation paths
  • Drive blameless post-incident reviews (PIRs) and track corrective actions
  • Own service health monitoring across infrastructure, applications, and dependencies
  • Design and maintain alerting strategies that align with SLIs/SLOs
  • Reduce alert fatigue through signal-to-noise improvements
  • Build dashboards using tools such as Grafana, Prometheus, Datadog, Splunk, CloudWatch
  • Automate repetitive operational tasks to reduce manual toil
  • Improve mean time to detect (MTTD) and mean time to resolve (MTTR)
  • Develop scripts and tools (Python, Bash, Go, etc.) to support NOC/SRE workflows
  • Implement self-healing and auto-remediation where possible
  • Partner with engineering teams to improve system design for reliability
  • Support and troubleshoot Linux-based systems, cloud platforms, Kubernetes/containerized environments
  • Assist with capacity planning and availability reviews
  • Ensure operational readiness for production releases

Requirements

What you’ll need
  • Strong Linux systems administration
  • Experience with incident management and production support
  • Familiarity with cloud infrastructure (AWS preferred)
  • Containers & orchestration (Docker, Kubernetes)
  • Monitoring/alerting platforms
  • Scripting or programming experience in Python, Bash, Go, or similar
  • Understanding of networking fundamentals (DNS, TCP/IP, load balancing)
  • Experience working in 24x7 NOC or production operations environments
  • Ability to handle high-pressure incidents calmly and effectively
  • Strong written and verbal communication for incident coordination
  • Comfort working from runbooks—but improving them when they fall short
  • Experience defining or operating to SLOs / SLIs
  • Prior migration from traditional NOC → SRE model
  • Infrastructure as Code experience (Terraform, Ansible, etc.)
  • Exposure to security, compliance, or regulated environments

Benefits

Comp & perks
  • Professional development opportunities
  • Flexible working hours
  • Work from home

ATS Keywords

✓ Tailor your resume
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
Linux systems administrationincident managementcloud infrastructurecontainersorchestrationscriptingprogrammingnetworking fundamentalsInfrastructure as Codemonitoring
Soft Skills
calm under pressurewritten communicationverbal communicationincident coordinationrunbook improvement