Distributed Systems Engineer - High-Availability Dispatch
Who we are:
Glydways is reimagining what public transit can be. We believe that mobility is the gateway to opportunity—connecting people to housing, education, employment, commerce, and care. By making transportation more accessible, affordable, and sustainable, we empower communities to thrive and unlock economic and social prosperity.
Our mission is to revolutionize transit with a solution that delivers high capacity, exceptional user experiences, unmatched affordability, and minimal environmental impact.
The Glydways system is a groundbreaking network of carbon-neutral, interconnected transit pathways powered by standardized autonomous vehicles on dedicated roadways. Operating 24/7 with on-demand access, it offers personalized and efficient mobility—without the burden of heavy upfront infrastructure costs or ongoing taxpayer subsidies.
With Glydways, we’re building more than a transportation system; we’re creating a future where everyone, everywhere, has the freedom to move.
About the Role:
Glydways’ Dispatch system is the centralized brain that coordinates our autonomous vehicle fleet. We’re looking for a senior distributed systems / backend engineer to design and implement state sharing between Dispatch instances, hot failover mechanisms, and robustness testing for this safety-critical, real-time service. This is an application-layer role: you will work primarily on the Dispatch codebase (C++), making stateful services correct and resilient under failure, not on generic DevOps, cloud account management, or CI/CD pipelines. You’ll partner closely with autonomy and ops teams to harden behavior, design recovery flows, and drive down flaky and unsafe production states over time.
Candidates whose experience is limited to Kubernetes administration, CI/CD tooling, or cloud configuration without owning stateful application behavior are not a fit for this role.
Responsibilities:
- Design and implement state sharing and replication between multiple Dispatch instances (tickets, journeys, vehicle state, restrictions).
- Build leader election and failover mechanisms (active/standby, hot/warm backup) that guarantee a single authoritative Dispatch at a time and clean handoff on failures.
- Harden Dispatch behavior for restart-safety and idempotency, ensuring retries, replays, and partial failures do not cause double assignment, inconsistent state, or unsafe conditions.
- Design and run stress, load, and fault-injection tests (including chaos experiments) to validate Dispatch behavior under high load, network issues, and process crashes.
- Improve system hardening and recovery flows, defining how Dispatch enters safe modes, recovers from faults, and resumes normal operation in a controlled way.
- Extend and tune observability for Dispatch (logs, metrics, traces, SLOs) so state divergence, failover events, and backlog issues are visible and diagnosable.
- Collaborate with autonomy, product, and ops teams to translate algorithmic and operational requirements into concrete guarantees around state, failover, and robustness.
- Participate in on-call and incident response for Dispatch, lead root-cause analysis for reliability issues, and drive long-term fixes into the application code and architecture.
Knowledge, Skills and Abilities:
- Proven experience designing and shipping stateful distributed services that stay correct under failures.
- Strong programming background in a systems language (C++ strongly preferred) and comfort working at the application layer (routing, tickets, vehicle state, safety envelopes).
- Hands-on experience with leader election / primary–secondary patterns, active/standby or similar, and state replication / recovery (snapshots, event logs, replay, or equivalent).
- Deep understanding of idempotent operations and message semantics (retries, duplicates, out-of-order messages) in networked, message-driven systems (TCP/UDP, gRPC, pub/sub, etc.).
- Experience designing and running stress, load, soak, and fault-injection/chaos tests for distributed systems, and using their results to drive system hardening.
- Strong observability and incident-response skills: defining SLOs, instrumenting metrics/traces, debugging complex failure modes, and leading postmortems for stateful services.
- Safety-critical or mission-critical mindset: familiarity with failure-mode analysis and designing for fail-safe / fail-operational behavior is a plus.
- Experience with cloud platforms is a plus, but this is not a pure DevOps or CI/CD role; candidates must have meaningful ownership of application-level behavior and state.
Glydways provides equal employment opportunities to all employees and applicants for employment and prohibits discrimination and harassment of any type without regard to race, color, religion, age, sex, national origin, disability status, genetics, protected veteran status, sexual orientation, gender identity or expression, or any other characteristic protected by federal, state or local laws.
Create a Job Alert
Interested in building your career at Glydways? Get future opportunities sent straight to your email.
Apply for this job
*
indicates a required field
