|About Vocera Voice Server Clusters / Network Problems and Clustering|
A self-healing mechanism automatically rejoins cluster nodes that are in a split brain state.
After self-healing takes effect, the node that has been active for the longest period of time remains active, and any other active nodes rejoin the cluster as standby nodes. The self-healing feature is installed automatically in Vocera 4.0 SP8 and later releases.
To support self-healing, each node keeps track of the length of time that it is active. 30 seconds after becoming active, a node notifies all other cluster nodes—active or standby—that it is active. At ongoing 30 second intervals, an active node continues to notify the other nodes of the length of time it has been active.
After the problem that caused the split brain state is resolved, the cluster nodes can communicate again. Each node then compares the length of time it has been active with the length of time other nodes have been active. The node that has been active for the longest period of time remains active; each of the other active nodes enters discovery mode and then comes online again as a standby node. Any badge that was connected to one of these new standby nodes iterates through its cluster list until it connects to the remaining active node.
Most split brain states are caused by transient network outages and are short-lived; consequently, the likelihood of independent active nodes getting out of sync is relatively small. The convenience of the self-healing feature typically outweighs the risk of losing changes made to independent active nodes. However, if you are intending to take advantage of clustering for disaster recovery purposes, you may want to disable the self-healing mechanism and rejoin cluster nodes manually.
Following is a procedure for disabling the self-healing mechanism. See Geographically Distributed Clusters for a discussion of disaster recovery. See Manually Rejoining a Split Brain for information about rejoining split brain nodes manually.
To disable the self-healing mechanism:
# ClusterFirstSplitBrainCheckTimeMillis (default=30000) # Time between becoming active and first check ClusterFirstSplitBrainCheckTimeMillis = -1