
Senior Sustaining Engineer
Hyperlayer
full-time
Posted on:
Location Type: Remote
Location: Canada
Visit company websiteExplore more
Job Level
About the role
- Act as a primary responder in a 24x7 on-call rotation for high-priority incidents, ensuring fast acknowledgment (MTTA targets) and resolution to minimize customer impact in our event-driven fintech platform
- Conduct root-cause analysis (RCA) for complex issues, collaborating closely with development teams to implement robust solutions and deliver RCAs within 5 business days for Sev1/Sev2 incidents.
- Lead the development and deployment of small, customer-facing features and improvements, ensuring alignment with business needs and system requirements while adhering to change success rates ≥99%.
- Work with mid- and junior-level engineers, providing guidance in incident response, troubleshooting best practices, and coding standards within a global rota, including handovers and knowledge sharing via tools like Rootly.
- Take ownership of software maintainability initiatives, identifying and implementing optimizations, and enhancing system performance to achieve availability ≥99.99% (four nines).
- Participate in regular post-incident reviews (blameless retros), documenting lessons learned and suggesting improvements to incident response processes and runbooks for our technology stack.
- Collaborate with the infrastructure team to monitor system health and proactively identify areas for improvement in stability and efficiency using tools like Datadog, Rootly, and CloudWatch/AppDynamics.
Requirements
- Bachelor's degree in computer science, Engineering, or a related field.
- Minimum of 5+ years of experience in sustaining engineering, DevOps, or software engineering with a focus on incident response and system reliability in fintech or regulated environments.
- Advanced troubleshooting skills and experience with Golang (preferred), Java, or similar languages, plus familiarity with event-driven architectures (e.g., NATS/JetStream, Redis clustering).
- Strong familiarity with monitoring and incident response tools (e.g., Datadog, Rootly) and experience implementing improvements in similar systems to meet SLAs like MTTA/MTTR.
- Proven ability to conduct in-depth root-cause analysis and implement long-term fixes in compliance-aware settings (e.g., GDPR/FCA-aligned).
- Experience mentoring or guiding mid-level engineers, with a focus on knowledge sharing and process improvements in geo-distributed teams.
- Awareness of ITILv4 principles (e.g., incident/change management) and tools like Rootly for unified workflows.
- Strong communication skills and the ability to work collaboratively with both technical and non-technical teams across time zones.
Benefits
- Out‑of‑hours on‑call rotation with additional compensation
- Equity, diversity, and inclusion initiatives
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
GolangJavaevent-driven architectureroot-cause analysistroubleshootingsystem reliabilitysoftware maintainabilityoptimizationsincident responsechange management
Soft Skills
communicationmentoringcollaborationguidanceknowledge sharingleadershipproblem-solvingprocess improvementteamworkadaptability