Manage Databases and Clusters
A database comprises one or more clusters which handle all query operations. A cluster is a group of servers each having its own main memory (RAM) and CPUs. All queries are executed on a cluster, which has its own size that defines the compute power based on the number of VCPUs and main memory. By adding multiple clusters, it is possible to separate workloads between teams or tenants and allow for greater concurrency.
The data and metadata for the database is stored in a central data store, such as an S3 bucket, not in the underlying clusters. All clusters in a database access the same underlying data and metadata. Changes that are persisted from one cluster are persisted in the data store, and are also persistent for all other transactions in the database, regardless of which cluster they are connected to.
The first cluster of a database has a special role and is called the main cluster. All other clusters are called worker clusters. The main cluster communicates with all worker clusters and ensures they have a consistent view of transactions and metadata. The communication between clusters is metadata-only; every cluster has a direct connection with end users and the central data store. You can connect to the main cluster like any other cluster; however, the main cluster must be running when worker clusters are running. Similarly, stopping the main cluster will also stop all worker clusters in the database.
The optimal cluster size is based on usage patterns. As a rule of thumb, start with a RAM size that corresponds to about 10% of your raw data. For example, if you want to process about 1TB of raw data, then the cluster size Small may be a good starting point. Depending on your workload, you may need a larger cluster to use additional CPU and memory. Sometimes, a smaller cluster may also meet your requirements and be more cost-effective. You can dynamically test which size best fits your needs by resizing a cluster. For more information, see Resize Cluster.
Scalability is the ability to increase and decrease resources based on the business demands. Scalability can be classified into Vertical and Horizontal Scaling.
Vertical scaling (scale up or scale down) refers to changing the compute power and RAM to optimize concurrency. When you scale up or down, the amount of VCPUs and RAM allocated to the cluster is adjusted based on the new cluster size. You can use vertical scaling when you want to:
- Speed up the queries you are executing.
- Run large complex queries or support bigger datasets without impacting the performance.
- Add more users or concurrency without affecting the performance.
A business has one database with one cluster. Over time, as data and the number of users on the system grow, users start to experience some slowdown while working on the database. In this case, resizing the cluster from size M to L would lead to a noticeable improvement of all queries. The administrator has complete control over when and how to scale based on the business needs and budget.
Horizontal scaling entails adding more clusters to optimize concurrency and manage higher workloads. You can use horizontal scaling when you want to isolate different workloads.
You can achieve an increased concurrency both by horizontal and vertical scaling. If there are no separate workloads with separate groups of users, we recommend starting with vertical scaling. If this is no longer sufficient or if there are different workloads, we recommend creating dedicated clusters.
A business can have the following two use cases to scale Exasol horizontally:
- A BI team uses Exasol via a BI tool and SQL clients. Automated reports or interactive queries are running 24/7. Almost all data are accessible, except the clickstream table.
- The data science team wants to analyze large datasets (clickstream table), which require a very large cluster with high memory and CPU resources to ensure fast processing of the queries. However, they do not need the cluster all the time.
Therefore, from a scaling perspective, having two separate clusters for BI and Data Science is more beneficial for the following reasons:
- Clusters will always have the relevant data in memory, leading to optimal performance .
- You can scale the clusters independently as needed for the individual workload.
- The compute resources are isolated, i.e., any work or actions the data science team does will not affect the performance of the BI workload.
- The cluster for the data science team can be stopped when not required and save cost.
Actions on Database and Cluster
The Databases page in the web console provides an overview of your databases and cluster configurations. As an account owner, you can add and manage the databases and clusters from this page. If you are an account member, you can only see the databases to which you have access.
Depending on your role and access level, you can perform the following database and cluster operations:
- Add Cluster
- Start and Stop a Cluster
- Change Auto-Stop Settings
- Rename Cluster
- Resize Cluster
- Delete Cluster
Backup & Restore
Exasol manages the backup and restore mechanisms for you. For more information, see Backup & Restore.