Geographically Distributed Clusters

In addition to providing fault tolerance, the nodes in a Vocera cluster can also assist in disaster recovery if you distribute them geographically, because the database is replicated to each node in the cluster.

For example, suppose your deployment has sites in both San Diego and New York City, and you set up two cluster nodes in each of those cities. If the active node is located in San Diego, your deployment would look similar to the following illustration:

Figure 1. Geographically distributed cluster Geographically distributed cluster

This deployment enables disaster recovery in a variety of situations. For example, suppose an earthquake causes the WAN link between the two cities to fail, but not the cluster nodes. In this situation, the two nodes in New York form their own cluster and keep Vocera available for that city, while the two nodes in San Diego continue running as a separate cluster and provide Vocera communications for that city, as shown in the following illustration:

Figure 2. Geographically distributed cluster after a WAN failure Geographically distributed cluster after a WAN failure

When the WAN link goes down, the two servers in New York lose contact with the active node in San Diego and go into discovery mode. One New York node emerges as an active node while the other remains in standby, and those two nodes form their own cluster. The badges in New York temporarily display searching for server, then find the active New York node.

If the original New York site has its own Vocera telephony Gateway server, that server also connects to the new active node in New York. The New York cluster starts running as an independent Vocera system within seconds. San Diego continues running and is unaffected by the outage, except it is also an independent cluster that is not connected to New York. Site-to-site calls between cities are not available until the WAN link is restored and the original cluster is re-established, but both cities continue to have Vocera service.

Because the two cities are now running independent clusters, the databases will get out of sync if anyone attempts to perform system maintenance. In addition, voice service logs, messages, and other files will not be replicated between the two clusters. When you restore the connection between the two clusters, these changes will be lost.

In a disaster-recovery scenario, you may need to allow the independent clusters to remain separate for an indefinite period of time, increasing the likelihood that the above files will get out of sync. When the connection between the clusters is restored, these differences will be lost, as described in The Self-Healing Mechanism.

Tip: If you intend to implement a geographically distributed cluster, have some form of change control in place in anticipation of a disaster. In addition, consider disabling the self-healing feature so you can manually rejoin the independent clusters after deciding how to handle any file differences.

The following table lists the system information that gets out of sync when a disaster occurs, and suggests a strategy for managing it:

Table 1. Disaster recovery strategies

What Gets Lost

Is it preventable?

Database changes.

Yes. Implement some form of change control such as one of the following:

  • Send a message to all system and tiered administrators telling them to avoid updating the database. Consider creating a group that revokes all tiered administrator permissions and temporarily add all the tiered administrator groups to it as members.

  • Record all changes you make to one system so you can update the other system with them after the independent clusters are rejoined.

  • Make all changes to both systems concurrently. This strategy may not be practical after a disaster and may be difficult to manage.

All user recordings (messages, learned names, and so forth)

Yes. Send a message or broadcast to Everyone, explaining what happened and warning them that their recordings will be lost.

Report logs.

No. The voice service logs relies on statistics that are recorded during calls. While the systems are running independently, they are independently maintaining their own statistics. One of these sets of statistics will be lost when the systems are rejoined.