Rabbit MQ Disaster Recovery(DR) Strategy

Nilay Tiwari
5 min readJan 13, 2021

--

Disaster Recovery

Disaster Recovery

Asynchronous replication of data between data centers. RabbitMQ does not currently support this feature but other message routing features can be leveraged to offer a partial solution.

Two things to replicate on rabbit mq between data center. First- Schema e.g. exchanges, queues, bindings, users, permissions, policies etc. Second- Data of Queues.

RabbitMQ does have support for replication of schema to same as primary cluster without data. A fail-over using only schema synchronization would provide availability but with a data loss window.

What is a data loss window ?

Nothing is perfect in this real world, any thing can fail. If disaster occur in any data center then RTO and RPO comes into picture as a part of SLA. First understand what is RPO and RTO ?

SLA of Disaster recovery

RTO refer Recovery Time Objective , How much time you will take to come back in normal state if disaster strikes.

RPO refer Recovery Point Objective , refers to the point in time in the past to which you will recover. The gap between the disaster and the RPO will likely be lost as a result of the disaster. Setting RPO 0 is not realistic in case of multi-DC.

Above are the disaster recovery basics. Now come to how we can implement replication in rabbit mq.

How to Replicate Schema

RabbitMQ has two features for schema replication.

  1. Export and import of schema.
  2. Tanzu Rabbit MQ.

Definitions export/import is a feature of the management plugin that allows the export and import of the schema via JSON files. You can find below mentioned options while accessing management console of rabbit mq in overview section

Management console

Tanzu RabbitMQ offers the Schema Sync Plugin which actively replicates schema changes to a secondary cluster. This is a commercial version.

How to replicate data in multi DC

Two approaches, both are asynchronous data replication across cluster. Those are used to moving messages between clusters not to mirror a cluster data for redundancy .

  1. Shovel Plugin
  2. Federation Plugin (Not covered in this Article)

Shovel Plugin

Dynamic shovel are configured using runtime parameters. They can be started and stopped at any time, including programmatically. Dynamic shovels can be used for both transient (one-off) and permanently running workloads.

Static shovels are configured using the advanced configuration file. They are started on node boot and are primarily useful for permanently running workloads. Any changes to static shovel configuration would require a node restart, which makes them highly inflexible.

Following are the commands used to enable shovel and shovel management plugin

rabbitmq-plugins enable rabbitmq_shovel
rabbitmq-plugins enable rabbitmq_shovel_management

you can login in rabbit mq management console and find these in admin section.

Enabled Shovel Plugin

You can also use this shovel plugin to replicate data between different virtual host. I explained here with different virtual host instead of data center.

Create your first dynamic shovel using UI .

Add a shovel

You need to provide source and destination detail in above mentioned screen. You need to know uri expression to give host detail in URI field.

URI example given below for Shovel Source and Destination URL

 amqp://username:password@host:port/vhost

Provide source queue or exchange name and destination queue or exchange name and other configuration. It will replicate data into other data center. You can also use this plugin to replicate data between different vhost. For security and data encryption in-transit you can implement mutual tls using URI parameters. For TLS parameters you can refer this link.

Once your shovel is created you can check it on shovel status section on Rabbit MQ management console.

Shovel Status

You can create shovel by using command , not covering this in detail in this article.

rabbitmqctl set_parameter shovel my-shovel

Known issue while data replication across data center.

Problem Statement

Active data center will have active consumers on rabbit mq, which consuming messages from queues in active data center. However this is not the case in passive data center. in Passive Data center no consumer are running on rabbit mq, only data is getting replicated by shovel plugin. It is a strong possibility of duplicate message processing in passive data center, once disaster strike.

Problem Statement

You can see clearly in above image that State-1 replicating message to other DC and in state-2 messages are consumed from active DC but not from passive DC.

Solution

To delete messages from Passive data center you can use follwoing approaches.

  1. Delete by TTL
  2. Add duplicate message check at consumer level.

I recommend to use both . There are multiple use cases where you can prefer any one. Use case which I faced , I chooses to use both. Using deletion of messages by TTL will reduce the message count from queues. Suppose if disaster strike after 2 month , there will be huge message count in queues. Hence, implementing TTL will delete messages from queues on regular interval.

Thanks for reading it. I welcome yours suggestion and improvement.

--

--

Nilay Tiwari
Nilay Tiwari

Written by Nilay Tiwari

Solution Architect | Java|Cloud|Microservices|kubernetes|NoSQL

No responses yet