SDDC: Monitoring
Learn how to monitor an SDDC solution.
The documentation in this section is intended for advanced users who are already familiar with how to install and administer Exasol databases using ConfD and Exasol Deployment Tool (c4).
Introduction
To ensure that an SDDC solution is functioning properly, it is important to monitor the status of the volumes. As long as all volumes and redundancies are built, any commit made in the database is written on the redundant node automatically during commit, which ensures that there is no potential for data loss in that respect.
However, the ability to write to the redundant copy requires that all volumes are online and operational. If all volumes are in the ONLINE state, the SDDC is functioning properly and the cluster is capable of handling a disaster scenario and swapping to the passive data center. In any other state (DEGRADED or RECOVERING), there is no guarantee that the cluster can handle a disaster scenario on either the passive or active side. The possibility of a recovery then depends on which nodes crashed, and if there is a full redundancy in place for those nodes.
Learn how to use various tools to ensure the following:
-
All volumes are in an ONLINE state.
-
The data and archive volumes are using the same nodes.
This is recommended to avoid issues in case of a future node failure, which could result in the data and archive volumes being in different states. For example, if a reserve node is taken offline to perform maintenance but is still in use by the archive volume, then the archive volume would be in a DEGRADED state but the data volume would be ONLINE.
-
All redundant segments are on nodes in DC 2.
-
No segments are on a reserve node.
Having segments on a reserve node causes degraded performance.
Monitoring volume states
To monitor the state of the volumes. use the ConfD job st_volume_info (use jq to filter the output).
Example:
confd_client st_volume_info vname: data_vol --json | jq -r '.state'
...
confd_client st_volume_info vname: arc_vol --json | jq -r '.state'
The desired result is ONLINE. Undesired results are DEGRADED, RECOVERING, or LOCKED.
Monitoring database info
To monitor the database, use the ConfD job db_info and grep for info. In normal operation, the return should be empty.
Example:
confd_client db_info db_name: PROD | grep info
info: ''
If info is not empty, then you need to take the appropriate action. A message is delivered in the following cases:
-
Redundancy is lost
-
A node is using data from a different node
Example:
confd_client db_info db_name: PROD | grep info
info: 'Payload of database node 22 resides on volume master node 12. 1 segments of the database volume are not online (missing redundancy)'
Monitoring segments
In the default state, each volume contains a MASTER segment and a REDUNDANT segment. You can use the following command to identify and parse which segments and redundant copies are stored on which nodes. You should also verify that any reserve nodes are not holding any segments of the data and archive volumes.
csinfo -R
You can also use the ConfD job st_volume_info to identify which nodes are in use by the volume, based on which nodes contain segments for the given volume. The archive and data volumes should be using the same list of nodes, although the order of the nodes presented in the list does not have to be identical.
confd_client st_volume_info vname: data_vol --json | jq -r '.volume_nodes'
Monitoring redundancy recovery
If one of the volumes is in the RECOVERING state, then a redundancy is being built from a different node. During recovery, a log message is printed in the Storage logs every 5 minutes with a counter on the given status and an ETA:
logd_collect Storage
For example, after moving a segment to n22, you would see a message like this in the logfile:
Recovery state for node 'n22': 250.60 GiB left | ETF 6m
Additionally, you can use csrec to monitor the recovery of a volume:
csrec -s -v <vol_id>