Distributed Systems & Reliability Engineer

Glydways

full-time

Posted on: 12/17/2025

Location Type: Remote

Location: Remote • 🇺🇸 United States

Visit company website

✨ AI Apply

Apply

Job Level

Mid-LevelSenior

Tech Stack

AnsibleDistributed SystemsGoKubernetes

About the role

Own the reliability, availability, and failover behavior of the centralized planning system in production, with a focus on high-availability architectures across servers and clusters.
Design and implement leader election, health checks, heartbeat protocols, and controlled failover/hand-off when instances fail or become partitioned.
Define and build state continuity mechanisms so backup instances can take over from recent state (tickets/trips/journeys, vehicle/site state, restrictions) instead of cold-starting.
Engineer restart-safe, idempotent workflows for trip/ticket handling and routing decisions so replays, retries, and partial failures do not cause double assignment or missing states.
Extend and refine recovery behaviors, ensuring the system gets to a safe state first and then resumes normal operations in a controlled, observable way.
Expand and maintain observability: logs, metrics, traces, dashboards, and alerts for key service indicators (latency, backlog, heartbeats, failover time, instance divergence).
Harden configuration, pipelines, and deployments for the system and related services, including validation of config changes and safe rollout strategies (rolling, blue-green, canary).
Design and maintain automated test and robustness suites, including scenario-based, stress, fault-injection/chaos, and long-running burn-in tests, and use results to drive hardening work.
Apply safety-critical, requirements-driven reasoning (including FMEA-style analyses) to functional changes, documenting assumptions and guarantees.
Collaborate with algorithm developers, Autonomy, Test Ops, and Product to align robustness and failover behavior with algorithmic guarantees, operational procedures, and milestones, and take long-term ownership of production health.

Requirements

Strong experience building and operating distributed, real-time backend systems (including C++ and Go services).
Deep understanding of networked, message-driven architectures (TCP/UDP, connection management, backpressure, timeouts, heartbeats, long-lived connections). Distributed databases with internal or external message queues.
Proven track record designing and implementing high-availability and failover patterns (leader election, active/standby, hot/warm backups, multi-server or multi-cluster setups, load-balancing).
Ability to design state replication and recovery mechanisms (snapshots, event logs, shared state stores, distributed key-value, streaming platforms) so services can resume from recent state with minimal disruption.
Expertise in idempotent, restart-safe operations and APIs that tolerate retries, duplicates, and out-of-order messages without corrupting state or violating safety constraints.
Strong background in observability and diagnostics: logging, metrics, tracing, SLO definition (latency, backlog, failover time, instance divergence) and debugging production states.
Experience with configuration-driven systems, deployment automation, and infrastructure as code (Kubernetes, Kustomize/Helm/Ansible or equivalent; rolling/blue-green/canary releases).
Hands-on experience with automated testing for distributed systems, including integration, scenario-based, stress, fault-injection/chaos, and long-running soak tests.
Safety-critical mindset and comfort working in a requirements-driven environment, using FMEA-style thinking to reason about failure modes and mitigations.
Strong ownership and collaboration skills, working closely with developers, ops, and product to improve reliability over time rather than focusing on one-off features or algorithm research.

Benefits

Equal employment opportunities
Prohibits discrimination and harassment

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills

C++Godistributed systemshigh-availability architecturesfailover patternsstate replicationidempotent operationsobservabilityautomated testinginfrastructure as code

Soft skills

strong ownershipcollaboration skillssafety-critical mindsetrequirements-driven reasoning