Glydways

Distributed Systems & Reliability Engineer

Glydways

full-time

Posted on:

Location Type: Remote

Location: Remote • 🇺🇸 United States

Visit company website
AI Apply
Apply

Job Level

Mid-LevelSenior

Tech Stack

AnsibleDistributed SystemsGoKubernetes

About the role

  • Own the reliability, availability, and failover behavior of the centralized planning system in production, with a focus on high-availability architectures across servers and clusters.
  • Design and implement leader election, health checks, heartbeat protocols, and controlled failover/hand-off when instances fail or become partitioned.
  • Define and build state continuity mechanisms so backup instances can take over from recent state (tickets/trips/journeys, vehicle/site state, restrictions) instead of cold-starting.
  • Engineer restart-safe, idempotent workflows for trip/ticket handling and routing decisions so replays, retries, and partial failures do not cause double assignment or missing states.
  • Extend and refine recovery behaviors, ensuring the system gets to a safe state first and then resumes normal operations in a controlled, observable way.
  • Expand and maintain observability: logs, metrics, traces, dashboards, and alerts for key service indicators (latency, backlog, heartbeats, failover time, instance divergence).
  • Harden configuration, pipelines, and deployments for the system and related services, including validation of config changes and safe rollout strategies (rolling, blue-green, canary).
  • Design and maintain automated test and robustness suites, including scenario-based, stress, fault-injection/chaos, and long-running burn-in tests, and use results to drive hardening work.
  • Apply safety-critical, requirements-driven reasoning (including FMEA-style analyses) to functional changes, documenting assumptions and guarantees.
  • Collaborate with algorithm developers, Autonomy, Test Ops, and Product to align robustness and failover behavior with algorithmic guarantees, operational procedures, and milestones, and take long-term ownership of production health.

Requirements

  • Strong experience building and operating distributed, real-time backend systems (including C++ and Go services).
  • Deep understanding of networked, message-driven architectures (TCP/UDP, connection management, backpressure, timeouts, heartbeats, long-lived connections). Distributed databases with internal or external message queues.
  • Proven track record designing and implementing high-availability and failover patterns (leader election, active/standby, hot/warm backups, multi-server or multi-cluster setups, load-balancing).
  • Ability to design state replication and recovery mechanisms (snapshots, event logs, shared state stores, distributed key-value, streaming platforms) so services can resume from recent state with minimal disruption.
  • Expertise in idempotent, restart-safe operations and APIs that tolerate retries, duplicates, and out-of-order messages without corrupting state or violating safety constraints.
  • Strong background in observability and diagnostics: logging, metrics, tracing, SLO definition (latency, backlog, failover time, instance divergence) and debugging production states.
  • Experience with configuration-driven systems, deployment automation, and infrastructure as code (Kubernetes, Kustomize/Helm/Ansible or equivalent; rolling/blue-green/canary releases).
  • Hands-on experience with automated testing for distributed systems, including integration, scenario-based, stress, fault-injection/chaos, and long-running soak tests.
  • Safety-critical mindset and comfort working in a requirements-driven environment, using FMEA-style thinking to reason about failure modes and mitigations.
  • Strong ownership and collaboration skills, working closely with developers, ops, and product to improve reliability over time rather than focusing on one-off features or algorithm research.
Benefits
  • Equal employment opportunities
  • Prohibits discrimination and harassment

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills
C++Godistributed systemshigh-availability architecturesfailover patternsstate replicationidempotent operationsobservabilityautomated testinginfrastructure as code
Soft skills
strong ownershipcollaboration skillssafety-critical mindsetrequirements-driven reasoning