Configuring and Managing Clusters / About Vocera Voice Server Clusters |
The flexibility of a distributed cluster architecture requires you to have a stable network environment.
Vocera clustering provides a distributed architecture that allows you to locate nodes anywhere on your network, including different subnets (as described in About Vocera Voice Server Clusters) and different geographic locations (as described in Geographically Distributed Clusters). This flexibility is intended in part to provide disaster recovery capabilities from catastrophic events such as an earthquake or a WAN failure.
In particular, either of the following network problems will cause unwanted cluster behavior:
Network outages
For Vocera purposes, any network event that blocks all routes between the active node and a standby node is an outage. For example, restarting a switch may cause an outage.
Excessive latency
The standby nodes each poll the active node periodically to draw down synchronization transactions. If the active node fails to service a poll from a standby node within 10 seconds, it fails over to one of the standby nodes.
Either of the network problems described above may result in the following cluster behavior:
Multiple nodes become active as independent servers that are isolated from each other (a split brain state).
Some badges may connect to one active server; other badges may connect to another active server.
The following illustration shows a simple cluster with an active node and a single standby node:
If the network connection between the nodes is lost, the active node sends an email to indicate that it has lost contact with a standby node. The active node continues to run, and badges that have not lost a network route to it remain connected to it. Badges that cannot find this active node display "Searching for server" and begin to cycle through their list of IP addresses, looking for the active server.
The standby node notices that it has lost contact with the active node, goes into discovery mode, fails to find the active node (because the network connection is down), and comes online as an active node. This new active node sends an email stating that it has become active, and any badges that were "Searching for server" may connect to it.
This situation is known as a split brain because multiple cluster nodes are active, and each node is unaware of other active nodes. This split brain state is shown in the following illustration:
Similarly, if excessive latency results in the active node failing to service a poll from a standby node within 10 seconds, the standby node enters discovery mode, the active node sends an email message indicating that it has lost contact with a standby, and one of the following situations occurs:
If the latency is transient, the standby node may find the active node and come out of discovery mode as a standby again.
In this situation, the standby rejoins the cluster, and the cluster does not enter a split brain state.
If the latency is great enough, the standby node may be unable to find the active node. The standby node comes out of discovery mode as an active node, and it sends an email indicating that it has come online as an active node.
In this situation, multiple nodes are active, and the cluster is in a split brain state.