Fail Safety (On-Prem)

A node in a cluster becomes unavailable if the hardware component or server where it is running fails. The fail safety process will indicate that a specific node within a cluster is no longer available, and the cluster nodes will replicate data to neighbor nodes if redundancy is configured.

When Exasol is installed on physical hardware you can use the hot standby mechanism for failover. With hot standby you have one or more active reserve nodes for the active nodes in your system. In case of a failure, the reserve node immediately takes over for the failed node.

The most important objective is data integrity. The failure of a hardware component does not cause data loss or data corruption. The hard drives of the Cluster nodes are configured in RAID 1 pairs to compensate single disk failures without any interruptions. Additionally the cluster nodes replicate data to neighbor nodes if redundancy 2 volumes are used - which is a best practice.

To achieve a fast operation recovery, the cluster operating system restarts automatically the necessary services if the corresponding resources (main memory, number of nodes, …) are available.

Exasol 4+1 Cluster: Redundancy 2

If volumes are configured with redundancy 2 then each node holds a mirror of data that is operated on by a neighbor node. If for example, n11 modifies A the mirror A‘ on n12 is synchronized over the private network. Should an active node fail, the reserve node will step in starting an instance.

What Happens on Node Failure

A node failure leads to the following sequence of actions within the Exasol Cluster:

  • In approximately five seconds, EXACluster OS realizes that a node has failed and stops all affected databases on the cluster.
  • In the following two seconds (approximately), a reserve node is activated by EXACluster OS and the databases are restarted.
  • In the following eight seconds (approximately), these databases can be connected to again by end users.
  • A background restore of segments towards the new active node is done while the databases are up in the following minutes.
  • It takes only couple of seconds for the database to be available.

Above timings are given for an average-sized cluster and may not be regarded as upper limits or as precise timings. Your mileage may vary, depending on your cluster size, the number of nodes and the database load.

Exasol 4+1 Cluster: Persistent node failure

If the failed node n12 does not become available again until the threshold Restore Delay (defaults to 10 Minutes) is over, the segments that resided on that failed node (A‘ and B) are copied to the newly activated node (former reserve node n15) using the mirrors B‘ on n13 and A on n11. This is a time consuming activity that puts a significant load on the private network. If the private network has been separated into database network and storage network, this copying is done through the storage network.

It is recommended to add a new reserve node in this scenario to replace the crashed node n12.

Exasol 4+1 Cluster: Transient node failure

If the failed node n12 comes back within Restore Delay, the segments on that node are now stale because their mirrors have been operated on in the meantime. They have to be re-synchronized before they can be used again. Nevertheless this scenario does not require a complete restore of the mirrors towards n15.

Fast mirror re-sync

After n12 came back, the stale segments have been re-synchronized, applying the changes on A and B‘ that have been done while n12 was offline. This activity was much faster and less load-intensive than the complete restore of these segments towards n15 that has been done on behalf of the persistent node failure. The instance on n15 works now on the Master Segment B residing on n12 until a restart of the database. That restart is not done automatically to avoid the short downtime associated with it.

Payload of database node x resides on volume master node y

This is how the above situation is seen in EXAoperation.

Database restart

If the database is restarted after a transient node failure, this re-establishes the initial scenario with n11-n14 as active nodes and n15 as reserve node. Drawback is that this causes a short period of downtime for the database. Another option is the Move Node operation:

Move node

Instead of restarting the database, alternatively a Move node operation can be done without causing a downtime. In this case the segments residing on n12 are being copied over the private network to n15, which can be time-consuming depending on the affected data volume. Also the private network can become significantly utilized during that period.

Select affected volume & node

In order to move the volume (instead of doing a database restart), after selecting the affected volume in the EXAStorage page, the node presently not used for the volume is to be selected.

Click on the Move Node button now.

Click again on the Move Node button after selecting the target node now.

Monitor Recovery Progress

The ongoing recovery can be monitored in the volume detail page now.

Log entries

The finish of the restore can be seen in the log maintained by the logservice.

What Happens on Storage Failure

You can configure the storage with redundancy for a fail-safety mechanism. The following process describes what happens if the storage fails.

The follow sections use a storage with redundancy 3 setup as an example.

Initial Situation

In the initial situation, there is one master segment handling all requests and two redundancy segments.

Master Segment fails

When the master segment fails:

  1. First redundancy segment becomes the deputy.
  2. Operation is redirected automatically.
  3. Application continues to work.

Master Segment is online again

When the master segment comes back online:

  1. It becomes master segment again.
  2. Copy on demand on read operations.
  3. A background restore is done.