Recently a customer asked me to setup replication between two distinct Percona XtraDB Clusters located in geographically separate data centers. The customer goal was to use one of the clusters only in case of disaster recovery. They tried extending the cluster, but because of the WAN latency impact on their writes and the requirements of a node in a third data center for
Easy enough right! Both clusters are cloud-based, of course, and the provider can stop/restart any node on short notice. This setup caused some concern for the customer around how to handle replication. Since they don’t have dedicated personnel to monitor replication, or at least handle alerts, they asked if we could find a way to automate the process. So, here we go!
We all try to solve the problems with the tools we know. In my case, I like Pacemaker a lot. So using Pacemaker was my first thought. In a cloud environment, a Pacemaker setup is not easy (wouldn’t that be a cluster in a cluster… a bit heavy). But wait! Percona XtraDB Cluster with Galera replication is already handling quorum, and it provides a means of exchanging information between the nodes. Why not use that?
We can detect quorum status the same way the clustercheck scripts do it. To exchange messages, why don’t we simply write to a table. The Galera replication will update the other nodes. I went on and wrote a bash script that is called by cron every minute. The script monitors the node state and the content of the table. If all is right, it updates the table to report its presence (and if it is acting as a slave or not). The script validates the presence of a slave in the cluster. If no reporting slave is found, the script proceeds to the “election” of a new slave, based on the wsrep_local_index value. Basically, the script is a big bunch of “if” statements. The script is here, and the basic documentation on how to set it up here
Of course, if it works for one cluster, it can work for two. I have configured my customer’s two Percona XtraDB Clusters in a master-to-master relationship using this script. I ran through a bunch of failure scenario cases. The script survived all of them! But of course, this is new. If you are going to implement this solution, run your own set of tests! If you find any problem, file an issue on GitHub. I’ll be happy to fix it!