2026-04-04 · 7 min read

How to design a failover policy that actually works

Failover is the capability that justifies multi-bearer connectivity. A vessel carrying Starlink as primary, VSAT as secondary, and L-band for backup has theoretically purchased resilience — a system that degrades gracefully when any single link fails. The word that matters in that sentence is theoretically. Most multi-bearer installations have a failover capability in the sense that the router has a priority list configured. Far fewer have a failover policy: a documented, tested, understood sequence of what triggers a switch, what happens when it does, and who knows about it.

The difference matters because failover without a policy is not resilience. It is uncertainty with better hardware. When the primary link degrades and the system switches to the backup bearer, the applications running on that connection do not automatically adapt. The crew does not automatically understand what changed. The shore-side team does not automatically receive a notification that the vessel is running on its backup infrastructure. All of those outcomes require decisions that were made before the failover event, documented before it, and tested before the vessel left port.

Trigger conditions are not the same as threshold settings

Most managed routers allow threshold configuration: the primary bearer is considered failed when signal quality drops below a defined level or when packet loss exceeds a defined percentage for a defined duration. These numbers get set during commissioning. In most cases, they are set by the installation engineer based on experience or vendor defaults, not based on analysis of what the vessel's operational applications actually require.

The threshold question has two parts that commissioning defaults typically answer as one. The first is: at what signal quality does the primary bearer fail to support the traffic that matters? This requires knowing what the traffic is — which applications, what their latency and packet-loss tolerance, what they are doing when the link degrades. The second is: how quickly should failover trigger? A threshold that triggers immediately on a brief signal fluctuation will generate spurious failovers on a vessel transiting weather. A threshold with too long a hysteresis window will allow real degradation to persist for minutes before the backup bearer activates. Both failure modes cause operational disruption. The defaults address neither.

Threshold settings that were never calibrated to operational requirements are not a policy. They are a starting point that was never updated.

The fallback sequence is a design decision

A vessel with three bearers — primary LEO satellite, secondary GEO VSAT, tertiary L-band — has a fallback sequence whether it has been designed or not. The question is whether the sequence reflects the vessel's operational priorities. L-band is low throughput but highly available and independent of constellation infrastructure. GEO VSAT is higher throughput but shares some vulnerability to the weather events that affect LEO performance on certain routes. The argument for routing L-band as the immediate failover from LEO, before attempting GEO, is that it preserves the most critical low-bandwidth traffic — position reporting, safety communications, engine monitoring — on the most reliable path.

The argument for using GEO VSAT as the first fallback is throughput continuity for operational applications. Neither argument is wrong. The design decision depends on vessel type, route, and the relative operational weight of different traffic categories. What is not acceptable is allowing the default priority order — whatever the router assigns based on interface numbering — to make that decision by default. The fallback sequence is an engineering choice that requires operational input.

Session continuity is a separate problem from bearer continuity

Bearer failover changes the vessel's IP address. An application that holds a persistent connection to a shore-side platform — a remote diagnostic session, a planned maintenance sync, a VPN tunnel — does not continue through a bearer switch. The connection drops, and the application must re-establish it. Whether it does this automatically depends on how the application was designed. Whether the crew knows to expect this depends on whether they were told.

Remote monitoring applications for engine and machinery systems are the highest-stakes failure mode. Many of these platforms depend on a persistent connection that the vessel's system maintains with a shore-side aggregation point. When bearer failover drops that connection, the shore-side platform loses visibility of the vessel's machinery state. If the platform does not reconnect automatically — and many do not — the vessel is dark to shore until someone notices and manually re-establishes the session. The failover worked. The outcome was the same as a total outage.

A failover that drops every persistent session has not maintained operational continuity. It has merely preserved the ability to reconnect.

The configuration that makes failover usable

Session persistence across bearer changes requires either application-level reconnection logic or a network layer that maintains address continuity through the switch. SD-WAN overlays and VPN concentrators configured for fast re-keying are the main technical approaches. Which one is appropriate depends on the applications in use and the network architecture. The point is that session continuity requires explicit design — it does not emerge from installing a multi-bearer router with a priority list.

QoS configuration is the paired requirement. When the vessel is running on its backup L-band link at a fraction of the throughput available on the primary, unmanaged traffic will overwhelm the available bandwidth immediately. Crew welfare traffic — video streaming, social media — will compete with bridge navigation data and engine monitoring. QoS rules that restrict non-critical traffic on backup bearers are not optional. They are the mechanism by which the failover policy becomes operationally viable rather than just technically present.

Testing is the only way to know the policy works

Failover testing has a bad reputation because it is disruptive. Taking down the primary link on a vessel in service to confirm that the backup activates correctly is not a test that happens on most vessels. The alternative — testing on commissioning, before the vessel is in service — is available on every new deployment and is almost never performed systematically.

A commissioning failover test requires three things: a defined test procedure, defined pass criteria, and someone present who can evaluate both. The procedure simulates primary bearer failure — physically disconnecting the antenna feed or using the router's interface disable function — and documents what the system does. The pass criteria define what a successful failover looks like: backup bearer activates within a defined time, traffic continues on critical priority queues, shore-side monitoring receives a notification that the primary link is down, and the system recovers cleanly when the primary link is restored. If any criterion is not met, the commissioning is not complete.

This is the test that most deployments skip. The cost of skipping it is discovering the failure in real conditions — during bad weather, during a port authority inspection, during the incident the failover was installed to prevent.

What visibility the policy requires

A failover policy that is working silently is not sufficient. The fleet manager needs to know which bearer any vessel is on at any given time, how long it has been on the backup bearer, and what triggered the switch. Without that visibility, failover events are invisible — the system performs correctly, the shore team has no awareness that it happened, and the monitoring data that would allow analysis of failover frequency and duration is never collected.

Bearer state visibility requires that the network monitoring platform logs failover events with timestamps and trigger conditions. This is a configuration requirement, not an automatic feature of most platforms. It also requires that the shore-side team has defined what they will do when they receive a failover notification: whether they escalate, whether they attempt diagnosis, or whether they log it and wait for the vessel to recover to primary. The response procedure is part of the policy.

What a working failover policy actually contains

A failover policy is a document, not a configuration. It names the bearers in priority order and explains why. It defines the trigger conditions for each failover event and the threshold values used. It describes the expected behaviour of critical applications during and after a failover, and where that behaviour requires specific configuration or application-level changes. It specifies the QoS rules that activate on backup bearers. It includes the test procedure and pass criteria used at commissioning, and the schedule for periodic re-testing. It identifies who receives failover notifications, what they do when they receive them, and when they escalate.

Operators who have written this document — even as a two-page internal procedure — consistently find that the act of writing it exposes gaps in the configuration that were not visible during day-to-day operation. The trigger thresholds turn out to be defaults. The QoS rules turn out not to have been configured. The application reconnection logic turns out not to work as expected. These are findings that are cheap when discovered during a policy review. They are expensive when discovered during a North Atlantic transit in February.

Orbit measures what your SLA should.

Multi-bearer visibility, incident governance, and monthly SLA packs — independent of your bearer providers.

Book a call