Fail Safety (Cloud)

This article describes how fail safety works in cloud deployments of Exasol.

If the hardware component or server where a cluster node is running fails, Exasol detects that a node is no longer available and triggers an automatic failover procedure. If the volumes have been configured with redundancy, the metadata on each node is replicated to mirrors on neighboring nodes, which means that data integrity is preserved on node failure.

In case of node failure in a cloud deployment, Exasol will automatically shut down the affected databases, then try to restart the failed node and restore database connectivity. If the failed node cannot be restarted, two slightly different standby failover mechanisms are available for cloud deployments: cold standby and hot standby.

Cold Standby

With this mechanism Exasol automatically starts up a suspended node that takes over for the failed node. Cold standby is slower but more cost-effective than hot standby, since it does not require having additional active nodes.

Cold standby is only available on cloud deployments of Exasol.

Hot Standby

With this mechanism you have one or more active reserve nodes in the cluster. In case of node failure, the reserve node immediately takes over for the failed node and data is copied to this node from the mirrors. Hot standby is a relatively fast failover mechanism, but it is expensive since it requires additional active nodes.

Hot standby is primarily intended for on premises installations of Exasol but can also be used in cloud deployments. The failover process is similar in a cloud deployment but does not put the same load on the network, since only metadata is copied between nodes when using object storage.

Cold Standby

The cold standby mechanism is designed specifically for deployment of Exasol on a cloud platform. This is a relatively slower but more cost effective mechanism for failover compared to the hot standby mechanism.

Prerequisites

  • The cluster must include at least one standby node in suspended mode.
  • The smallest supported cluster configuration is 3 active nodes + 1 standby node (3+1), because cold standby requires that more than 50% of the active data nodes are available during an outage.

Previous versions of Exasol required the Exasol cloud failover plug-in for cold standby. This functionality is now built into Exasol.

What happens on node failure

A node failure triggers the following automatic sequence of actions:

  1. All affected databases on the cluster are stopped.
  2. The failover mechanism tries to restart the failed node and bring it back online, then restarts the databases.
  3. If the failed node could not be restarted, the failover mechanism starts the standby node and brings it online.
  4. Database connectivity is restored.
  5. A background restore of data segments from mirrors to the new active node is started.

Hot Standby

Since cloud deployments use object storage, only metadata is stored on the nodes. This means that the amount of data transferred on the network is much less than when using block storage.

Example: 4+1 cluster with redundancy level 2

The following example shows an Exasol cluster with 4 active data nodes and one reserve node (4+1). The cluster also has an access node, which contains no data and is not considered by the failover mechanism.

The volumes are configured with redundancy level 2, which means that each node contains a mirror of the data segments that is operated on by a neighbor node. For example, if node n11 modifies A, the mirror A‘ on the neighbor node n12 is synchronized over the private network.

fail safety model 1

What happens on node failure

A node failure triggers the following automatic sequence of actions by EXACluster OS:

  1. The node failure is detected.
  2. All affected databases on the cluster are stopped.
  3. The reserve node is activated.
  4. All databases are restarted.
  5. Database connectivity is restored.

What happens next depends on whether the failed node comes back online within a specific timeout period (transient node failure) or if it does not come back (persistent node failure). In case of a transient node failure, a complete restore of data segments from mirrors to the new active node will not be required.

Persistent node failure

If the failed node n12 does not become available again before the timeout threshold is reached (the default value is 10 minutes), the segments A‘ and B that resided on the failed node are automatically copied to n15 – the former reserve node – using their respective mirrors A on n11 and B‘ on n13.

persistent failure

In case of a persistent node failure we recommend that you add a new reserve node to replace the failed node.

Transient node failure

If the failed node n12 comes back before the timeout threshold is reached a complete restore of the mirrors towards node n15 is not necessary, and segments will therefore not be copied between the nodes. The segments A’ and B on node n12 are however now stale, since their mirrors on nodes n11 and n13 have been operated on in the meantime. These segments therefore have to be re-synchronized before they can be used again.

transient failure

Fast mirror re-sync

As soon as node n12 is back online, the stale segments are automatically re-synced by the cluster operating system, applying the changes on A and B‘ that were done on their respective mirrors while n12 was offline.

fast mirror resync

The instance on node n15 now works on the segments residing on node n12, until the database is restarted.

To avoid downtime, the database is not automatically restarted.

Restart the database

Restarting the database after a transient node failure will re-establish the initial scenario with n11n14 as active nodes and n15 as reserve node.

fail safety model 1

Move data between nodes

Since restarting the database requires additional downtime, an alternative method that will not cause any downtime is to move data between the nodes using the ConfD job st_volume_move_node. For example:

confd_client st_volume_move_node vname: DATA_VOLUME_1 src_nodes: '[12]' dst_nodes: '[15]'

In this case the segments residing on node n12 are moved to node n15. The data is moved over the private network.

Verify where the database node payload resides

To find out on which volume master node the payload of a database node resides, use the ConfD job db_info:

confd_client db_info db_name: MY_DATABASE
...
info: Payload of database node 12 resides on volume master node 11.
...