Yapily

Incident Analyst

Yapily

full-time

Posted on:

Origin:  • 🇬🇧 United Kingdom

Visit company website
AI Apply
Apply

Job Level

JuniorMid-Level

Tech Stack

AWSCloudGoogle Cloud PlatformGrafanaITSMKubernetesServiceNowSplunkSwift

About the role

  • Own Incidents End-to-End: Take full ownership of managing and coordinating incidents, from minor issues to critical, platform-wide events, across Yapily’s production environment.
  • Command and Coordinate: Mobilise and lead cross-functional technical teams (Engineering, DevOps, SRE, Customer Support) to ensure the swift investigation, mitigation, and resolution of incidents.
  • Communicate with Clarity: Drive clear, concise, and timely communication updates to all relevant internal and external stakeholders, including leadership, customer support, and commercial teams.
  • Learn and Improve: Lead blameless post-mortem reviews to identify the root cause of incidents. You will be responsible for tracking and driving the implementation of preventative measures to avoid future recurrence.
  • Ensure Coverage: Participate in an on-call rotation to provide 24/7 coordination coverage for critical incidents.
  • Be Proactive: Utilise our monitoring and observability platforms to proactively identify and address potential issues before they escalate into customer-impacting incidents.
  • Maintain Rigorous Standards: Ensure the integrity of our incident management data by meticulously documenting incident timelines, actions taken, and outcomes in our tooling.
  • Evolve Our Processes: Act as a key contributor to the evolution of our incident management framework, continuously providing feedback to improve our processes, tooling, and documentation.

Requirements

  • Own Incidents End-to-End: Take full ownership of managing and coordinating incidents, from minor issues to critical, platform-wide events, across Yapily’s production environment.
  • Command and Coordinate: Mobilise and lead cross-functional technical teams (Engineering, DevOps, SRE, Customer Support) to ensure the swift investigation, mitigation, and resolution of incidents.
  • Communicate with Clarity: Drive clear, concise, and timely communication updates to all relevant internal and external stakeholders, including leadership, customer support, and commercial teams.
  • Learn and Improve: Lead blameless post-mortem reviews to identify the root cause of incidents. You will be responsible for tracking and driving the implementation of preventative measures to avoid future recurrence.
  • Ensure Coverage: Participate in an on-call rotation to provide 24/7 coordination coverage for critical incidents.
  • Be Proactive: Utilise our monitoring and observability platforms to proactively identify and address potential issues before they escalate into customer-impacting incidents.
  • Maintain Rigorous Standards: Ensure the integrity of our incident management data by meticulously documenting incident timelines, actions taken, and outcomes in our tooling.
  • Evolve Our Processes: Act as a key contributor to the evolution of our incident management framework, continuously providing feedback to improve our processes, tooling, and documentation.
  • Proven Experience: Approximately 2-4 years of experience in a dedicated Incident, Major Incident, or Command Centre role, ideally within a fast-paced SaaS, FinTech, or other regulated technology environment.
  • Exceptional Communication: Outstanding verbal and written communication skills, with the ability to articulate complex technical situations clearly and calmly to both technical and non-technical audiences.
  • Grace Under Pressure: A calm and resilient demeanour with the proven ability to make logical, decisive decisions in high-pressure situations.
  • Stakeholder Management: Demonstrable experience in managing expectations and communications with a wide range of stakeholders, from engineers to executive leadership.
  • Analytical Mindset: A natural problem-solver with strong analytical skills and meticulous attention to detail.
  • Curiosity and Continuous Learning: An eagerness to dig deeper into problems, ask the right questions, and proactively seek opportunities to improve systems, processes, and your own knowledge.
  • Familiarity with modern cloud and containerised environments (e.g., AWS, GCP, Kubernetes).
  • Hands-on experience with monitoring and observability tools (e.g., Grafana, Datadog, New Relic, Splunk).
  • Experience with ITSM and incident response tooling (e.g., Incident.io, PagerDuty, Opsgenie, Jira Service Management, ServiceNow).
  • Understanding of Site Reliability Engineering (SRE) principles and methodologies.