
Distributed Systems & Reliability Engineer
Glydways
full-time
Posted on:
Location Type: Remote
Location: Remote • 🇺🇸 United States
Visit company websiteJob Level
Mid-LevelSenior
Tech Stack
AnsibleDistributed SystemsGoKubernetes
About the role
- Own the reliability, availability, and failover behavior of the centralized planning system in production, with a focus on high-availability architectures across servers and clusters.
- Design and implement leader election, health checks, heartbeat protocols, and controlled failover/hand-off when instances fail or become partitioned.
- Define and build state continuity mechanisms so backup instances can take over from recent state (tickets/trips/journeys, vehicle/site state, restrictions) instead of cold-starting.
- Engineer restart-safe, idempotent workflows for trip/ticket handling and routing decisions so replays, retries, and partial failures do not cause double assignment or missing states.
- Extend and refine recovery behaviors, ensuring the system gets to a safe state first and then resumes normal operations in a controlled, observable way.
- Expand and maintain observability: logs, metrics, traces, dashboards, and alerts for key service indicators (latency, backlog, heartbeats, failover time, instance divergence).
- Harden configuration, pipelines, and deployments for the system and related services, including validation of config changes and safe rollout strategies (rolling, blue-green, canary).
- Design and maintain automated test and robustness suites, including scenario-based, stress, fault-injection/chaos, and long-running burn-in tests, and use results to drive hardening work.
- Apply safety-critical, requirements-driven reasoning (including FMEA-style analyses) to functional changes, documenting assumptions and guarantees.
- Collaborate with algorithm developers, Autonomy, Test Ops, and Product to align robustness and failover behavior with algorithmic guarantees, operational procedures, and milestones, and take long-term ownership of production health.
Requirements
- Strong experience building and operating distributed, real-time backend systems (including C++ and Go services).
- Deep understanding of networked, message-driven architectures (TCP/UDP, connection management, backpressure, timeouts, heartbeats, long-lived connections). Distributed databases with internal or external message queues.
- Proven track record designing and implementing high-availability and failover patterns (leader election, active/standby, hot/warm backups, multi-server or multi-cluster setups, load-balancing).
- Ability to design state replication and recovery mechanisms (snapshots, event logs, shared state stores, distributed key-value, streaming platforms) so services can resume from recent state with minimal disruption.
- Expertise in idempotent, restart-safe operations and APIs that tolerate retries, duplicates, and out-of-order messages without corrupting state or violating safety constraints.
- Strong background in observability and diagnostics: logging, metrics, tracing, SLO definition (latency, backlog, failover time, instance divergence) and debugging production states.
- Experience with configuration-driven systems, deployment automation, and infrastructure as code (Kubernetes, Kustomize/Helm/Ansible or equivalent; rolling/blue-green/canary releases).
- Hands-on experience with automated testing for distributed systems, including integration, scenario-based, stress, fault-injection/chaos, and long-running soak tests.
- Safety-critical mindset and comfort working in a requirements-driven environment, using FMEA-style thinking to reason about failure modes and mitigations.
- Strong ownership and collaboration skills, working closely with developers, ops, and product to improve reliability over time rather than focusing on one-off features or algorithm research.
Benefits
- Equal employment opportunities
- Prohibits discrimination and harassment
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
C++Godistributed systemshigh-availability architecturesfailover patternsstate replicationidempotent operationsobservabilityautomated testinginfrastructure as code
Soft skills
strong ownershipcollaboration skillssafety-critical mindsetrequirements-driven reasoning