Using this statement you can control the distribution of rows across the cluster server and within a server.
You need to have the system privilege ALTER ANY TABLE, object privilege ALTER on the table or its schema, or the table belongs to the current user or one of the user roles.
- Explicitly distributed/partitioned tables can accelerate certain SQL statement significantly. However, it depends heavily on actual use case and the overall system usage.
The execution time of this command can take some time. You should use it only when you are aware of its consequences. In particular, it can influence the performance of queries negatively.
- When you join two tables, and both the tables are distributed by the join columns, then the join can be computed locally on the server nodes without having to transfer any data across the cluster.
- Executing an aggregation using GROUP BY on a single table will be accelerated if the distribution columns are a subset of the group by columns.
- If a table is explicitly distributed, the distribution of the rows across the cluster nodes is determined by the hash value of the distribution keys. The order of the defined columns is irrelevant. The concrete mapping between column values and database nodes can be simulated with function VALUE2PROC.
- If you define inappropriate distribution keys which would lead to an imbalance of rows in the cluster, the command is aborted, and a corresponding error message is thrown.
- Partitioning of a table can be useful if it is extremely large and does not fit in the main memory anymore.
- You only have to define the columns specifying the partitioning key. The actual partitions are automatically organized by the system.
- In case of a selective filter condition on the partitioning column of a very large table, less data has to be loaded from disk if it does not yet reside in the main memory cache.
- This is applicable to DELETE operations that can be processed more efficiently.
- Partitioning keys may only contain fixed-sized sortable data types: DECIMAL, DATE, TIMESTAMP, DOUBLE, BOOLEAN, INTERVAL.
- The order of the defined columns is relevant.
- If the table distribution/partitioning was not defined explicitly, the system distributes the rows equally across the cluster nodes and within a node by the insert order.
- The explicit distribution/partitioning of a table can also be defined directly in the CREATE TABLE statement. This status is sustained when inserting, deleting or updating rows.
- Using the DROP DISTRIBUTION/PARTITION KEYS statement, you can undo the explicit distribution/partitioning of a table. After that, the rows will again be distributed randomly, but equally across the cluster nodes.
- After setting or dropping the distribution or partitioning keys, all internal indexes on the given table are rebuilt.
- By executing the DESC[RIBE] statement or by using the information in the system table EXA_ALL_COLUMNS, you can identify the distribution and partitioning keys of a table. In system table EXA_ALL_TABLES, you can see whether a table is explicitly distributed or partitioned. The system table EXA_ALL_COLUMNS shows for each column, if it is part of the distribution key and its ordinal position in the case it belongs to the partition key.
CREATE TABLE my_table (order_id INT, shop_id INT, order_item VARCHAR(100), order_price DOUBLE, order_date DATE); ALTER TABLE my_table DISTRIBUTE BY shop_id, PARTITION BY order_date; DESCRIBE my_table; COLUMN_NAME SQL_TYPE NULLABLE DISTRIBUTION_KEY PARTITION_KEY ----------- ------------------- -------- ---------------- ------------- ORDER_ID DECIMAL(18,0) TRUE FALSE FALSE SHOP_ID DECIMAL(18,0) TRUE TRUE FALSE ORDER_ITEM VARCHAR(100) UTF-8 TRUE FALSE FALSE ORDER_PRICE DOUBLE TRUE FALSE FALSE ORDER_DATE DATE TRUE FALSE TRUE ALTER TABLE my_table DROP DISTRIBUTION KEYS;