Reliability Engineer

ShiftCare

full-time

Posted on: 10/13/2025

Location Type: Remote

Location: Remote • 🇦🇺 Australia

✨ AI Apply

Mid-LevelSenior

Distributed Systems

About the role

Own and improve our CI/CD pipelines (CircleCI), reducing deploy times and failure rates.
Design and implement observability tooling - from synthetic checks and smoke tests to meaningful alerts and dashboards.
Build reliable retry and back-off mechanisms for critical user workflows.
Help architect and implement failover and fallback mechanisms for critical vendors and workflows.
Work with Support to build debug tooling and dashboards that empower non-engineers.
Collaborate with engineering to define and template runbooks, kill switches, and disaster mitigation patterns.
Champion performance tuning and scalability improvements.
... and many other things, driven by you!

You thrive on ownership. Identifying problems, proposing solutions, and driving them to completion.
You’re passionate about reliability, observability, and building robust distributed systems.
You bring experience working in a modern SaaS environment, have learnt lessons along the way, and are eager to apply that expertise in a new context.
You have deep knowledge of background job processing, eventing, caching, and distributed systems.
You have proven experience improving CI/CD pipelines. We currently use CircleCI but don't discard a migration.
You’re comfortable designing and improving observability stacks (New Relic, Datadog, Honeycomb, etc.).
You’ve built resilient systems using retries, back-offs, queueing, circuit breakers, graceful degradation, kill switches, isolation of workloads, etc.
You care deeply about developer ergonomics and fostering a culture of reliability.
You have a bias toward action. Delivering tools that improve both system behavior and developer happiness.

Tip: use these terms in your resume and cover letter to boost ATS matches.

CI/CD pipelinesCircleCIobservability toolingsynthetic checkssmoke testsretriesback-off mechanismsdistributed systemsbackground job processingeventing

ownershipproblem-solvingreliabilitycollaborationdeveloper ergonomicsbias toward actiondriving solutions to completionfostering culture of reliability