Tech Stack
AWSCloudGoogle Cloud PlatformGrafanaITSMKubernetesServiceNowSplunkSwift
About the role
- Own Incidents End-to-End: Take full ownership of managing and coordinating incidents, from minor issues to critical, platform-wide events, across Yapily’s production environment.
- Command and Coordinate: Mobilise and lead cross-functional technical teams (Engineering, DevOps, SRE, Customer Support) to ensure the swift investigation, mitigation, and resolution of incidents.
- Communicate with Clarity: Drive clear, concise, and timely communication updates to all relevant internal and external stakeholders, including leadership, customer support, and commercial teams.
- Learn and Improve: Lead blameless post-mortem reviews to identify the root cause of incidents. You will be responsible for tracking and driving the implementation of preventative measures to avoid future recurrence.
- Ensure Coverage: Participate in an on-call rotation to provide 24/7 coordination coverage for critical incidents.
- Be Proactive: Utilise our monitoring and observability platforms to proactively identify and address potential issues before they escalate into customer-impacting incidents.
- Maintain Rigorous Standards: Ensure the integrity of our incident management data by meticulously documenting incident timelines, actions taken, and outcomes in our tooling.
- Evolve Our Processes: Act as a key contributor to the evolution of our incident management framework, continuously providing feedback to improve our processes, tooling, and documentation.
Requirements
- Own Incidents End-to-End: Take full ownership of managing and coordinating incidents, from minor issues to critical, platform-wide events, across Yapily’s production environment.
- Command and Coordinate: Mobilise and lead cross-functional technical teams (Engineering, DevOps, SRE, Customer Support) to ensure the swift investigation, mitigation, and resolution of incidents.
- Communicate with Clarity: Drive clear, concise, and timely communication updates to all relevant internal and external stakeholders, including leadership, customer support, and commercial teams.
- Learn and Improve: Lead blameless post-mortem reviews to identify the root cause of incidents. You will be responsible for tracking and driving the implementation of preventative measures to avoid future recurrence.
- Ensure Coverage: Participate in an on-call rotation to provide 24/7 coordination coverage for critical incidents.
- Be Proactive: Utilise our monitoring and observability platforms to proactively identify and address potential issues before they escalate into customer-impacting incidents.
- Maintain Rigorous Standards: Ensure the integrity of our incident management data by meticulously documenting incident timelines, actions taken, and outcomes in our tooling.
- Evolve Our Processes: Act as a key contributor to the evolution of our incident management framework, continuously providing feedback to improve our processes, tooling, and documentation.
- Proven Experience: Approximately 2-4 years of experience in a dedicated Incident, Major Incident, or Command Centre role, ideally within a fast-paced SaaS, FinTech, or other regulated technology environment.
- Exceptional Communication: Outstanding verbal and written communication skills, with the ability to articulate complex technical situations clearly and calmly to both technical and non-technical audiences.
- Grace Under Pressure: A calm and resilient demeanour with the proven ability to make logical, decisive decisions in high-pressure situations.
- Stakeholder Management: Demonstrable experience in managing expectations and communications with a wide range of stakeholders, from engineers to executive leadership.
- Analytical Mindset: A natural problem-solver with strong analytical skills and meticulous attention to detail.
- Curiosity and Continuous Learning: An eagerness to dig deeper into problems, ask the right questions, and proactively seek opportunities to improve systems, processes, and your own knowledge.
- Familiarity with modern cloud and containerised environments (e.g., AWS, GCP, Kubernetes).
- Hands-on experience with monitoring and observability tools (e.g., Grafana, Datadog, New Relic, Splunk).
- Experience with ITSM and incident response tooling (e.g., Incident.io, PagerDuty, Opsgenie, Jira Service Management, ServiceNow).
- Understanding of Site Reliability Engineering (SRE) principles and methodologies.