The Self-Healing Mechanism

A self-healing mechanism automatically rejoins cluster nodes that are in a split brain state.

After self-healing takes effect, the node that has been active for the longest period of time remains active, and any other active nodes rejoin the cluster as standby nodes. The self-healing feature is installed automatically in Vocera 4.0 SP8 and later releases.

To support self-healing, each node keeps track of the length of time that it is active. 30 seconds after becoming active, a node notifies all other cluster nodes—active or standby—that it is active. At ongoing 30 second intervals, an active node continues to notify the other nodes of the length of time it has been active.

After the problem that caused the split brain state is resolved, the cluster nodes can communicate again. Each node then compares the length of time it has been active with the length of time other nodes have been active. The node that has been active for the longest period of time remains active; each of the other active nodes enters discovery mode and then comes online again as a standby node. Any badge that was connected to one of these new standby nodes iterates through its cluster list until it connects to the remaining active node.

Important: While the cluster is in a split brain state, the active nodes have independent databases that will get out of sync if anyone attempts to perform system maintenance. Similarly, Vocera Report Server logs and any user recordings such as messages or learned names get out of sync over time, because they are stored only on the active node to which the badge is attached. When the self-healing mechanism joins a formerly active node to the cluster as a standby, any differences on that formerly active node are lost.

Most split brain states are caused by transient network outages and are short-lived; consequently, the likelihood of independent active nodes getting out of sync is relatively small. The convenience of the self-healing feature typically outweighs the risk of losing changes made to independent active nodes. However, if you are intending to take advantage of clustering for disaster recovery purposes, you may want to disable the self-healing mechanism and rejoin cluster nodes manually.

Following is a procedure for disabling the self-healing mechanism. See Geographically Distributed Clusters for a discussion of disaster recovery. See Manually Rejoining a Split Brain for information about rejoining split brain nodes manually.

To disable the self-healing mechanism:

  1. On each cluster node, navigate to the \vocera\server\ directory and open the properties.txt file in a text editor.
  2. Add the ClusterFirstSplitBrainCheckTimeMillis property and set its value to -1 as follows:
    # ClusterFirstSplitBrainCheckTimeMillis (default=30000)
    # Time between becoming active and first check
    ClusterFirstSplitBrainCheckTimeMillis = -1			
  3. Save the properties.txt file.
  4. To load the updated properties.txt file, restart the Vocera Voice Server(s).
    1. Stop and start the standby node(s). See Stopping and Restarting the Server. The standby node(s) automatically perform a remote restore.
    2. After remote restore is completed on the standby node(s), force a failover on the active node by choosing Cluster > Failover in the Vocera Control Panel.