Synchronous Dual Data Center (SDDC)

This article describes the Exasol synchronous dual data center (SDDC) solution in more detail.

Introduction

A synchronous dual data center solution (SDDC) has one cluster stretched across two separate sites (data centers). Each data center has a database instance with the exact same number of nodes. Since the instances share the same data volume, only one of them can be running. The production database is normally running on the primary site, while data blocks in the master segments there are mirrored to redundancy segments on the secondary site.

This solution provides business continuity with minimum downtime in case of both node failure or a complete outage on the primary site. Since communication only happens on the storage layer during normal operation, there is no risk of performance or syncing issues due to network latency.

In the following example, data center 1 (DC 1) is the main site and data center 2 (DC 2) is the secondary site. The production database instance PROD runs on the two active nodes n11 and n12, and has one reserve node, n13. The secondary database instance PROD_DR has two active nodes, n14 and n15, and one reserve node, n16. When the cluster is operating normally, the database is only running on the nodes in DC 1 while the nodes in DC 2 are offline.

In the storage layer, the active nodes in DC 1 operate on the corresponding local master segments, which are mirrored to redundancy segments in DC 2.

The license server runs on DC 1 while this is the primary site. DC 1 will therefore have quorum (a majority of nodes) in case of a network outage that interrupts the connection between the two sites.

SDDC normal operation

Site failure scenario

If DC 1 has an outage, the license server and the PROD_DR nodes n14 and n15 in DC 2 are started. The PROD_DR nodes operate on their corresponding local segments, which are now deputy segments for n11 and n12. This failover method causes zero data loss, and the database downtime is typically less than a minute.

Automatic failover on site failure is not provided by Exasol and must be set up separately.

SDDC site failure

When DC 1 comes back online, PROD_DR operates on the n11 and n12 master segments in DC 1. However, these segments are now stale because of changes on the deputy segments in DC 2 during the outage. The segments are therefore resynced.

SDDC site recovered after outage

Since the database is now operating across both sites, there is a risk of performance and syncing issues due to network latency. Restart the production database as soon as possible when the failure has been resolved.

Node failure scenarios

If one of the active nodes in DC 1 fails, the database is automatically restarted. The former reserve node n13 is then activated and immediately operates on the corresponding redundancy segment in DC 2. This is essentially the same behavior as in a normal hot standby failover procedure, except that the redundancy segment is on a secondary site.

What happens next depends on whether the node comes back online within the restore delay threshold (transient failure) or not (persistent failure). The default restore delay threshold value is 10 minutes. To learn how to change the restore delay threshold, see Edit a Database in the Administration section for your installation platform.

Transient node failure: segments on reactivated node are stale and resynced

If the failed node is back online within the restore delay threshold, copying data over the network is not needed. However, the master segment on DC 1 is now stale and must be resynced from the redundancy segment on DC 2 (copy-on-demand).

Persistent node failure - master segment is recreated

If the failed node is not back online within the restore delay threshold, the failure is considered to be persistent. The master segment on DC 1 is then recreated from the redundancy segment on DC 2.

This operation can be very time consuming, depending on the amount of data that must be transferred.

SDDC node failure

The instance on node n13 will continue to operate on the redundancy segment in DC 2 until the database is restarted.

Since the database is now operating across both sites, there is a risk of performance and syncing issues due to network latency. Restart the production database as soon as possible when the failure has been resolved.