Best Practices
This article describes best practices for loading data into Exasol.
Use ELT instead of ETL
ETL (Extract, Transform, Load) is one of the most commonly used methods for transferring data from a source system to a database. However, there are cases where you might want to use ELT instead.
An ETL process involves extracting the data from sources, transforming it, and then loading this data into the database.
ELT (Extract, Load, Transform) process involves extracting data from sources, loading it into a staging area in the database, transforming the data using SQL (adjusting formats, checking primary keys and foreign keys, checking data quality, normalizing data), and then integrating data into the target schema. The staging area here is usually a schema within the database which buffers the data for the transformation. The transformed data is then integrated into the corresponding target schema of the same Exasol database instance.
The advantages of using a staging area for ELT process are:
- Since the transformation is done from within the database, the full cluster performance is utilized.
- It provides fast and easy integration of data from the staging area into the target schema.
- There is no contamination of data in the target schema with the temporary staging data.
Data transformation
When transforming the source data within Exasol, consider a change in paradigm. Instead of processing data row-based, processing data as data sets is more appropriate. When transforming data as data sets, the process is as follows:
When transforming the source data within Exasol, processing data as data sets is more efficient than row-based processing. When transforming data as data sets, the process is as follows:
- Identify the possible error cases
- Check all rows and set error codes
- Integrate all correct rows in the target tables
- Optional – keep invalid data on the staging area for further processing
Primary key constraint check
Since Exasol does not support cursors, execute all checks on the data within the staging area, and integrate only the transformed and checked data into the target schema. For example:
Initial State:
- Fact table SALES_POSITIONS - PK (SALES_ID, POSITION_ID) - FK (SALES_ID)
– Staging table SALES_POS_UPDATE – This includes only the relevant columns.
Therefore, only the new sales positions are transferred from the staging table to target schema, leaving the erroneous entries in the staging table with an appropriate error message.
As an example for set-based approach to the above:
UPDATE STG.SALES_POS_UPDATE su
SET error_text='SALES_ID already exists'
WHERE EXISTS
(
SELECT 1 FROM RETAIL.SALES_POSITIONS s
WHERE s.SALES_ID = su.SALES_ID
)
;
INSERT INTO RETAIL.SALES_POSITIONS
(SELECT SALES_ID, POSITION_ID, ARTICLE_ID, AMOUNT,
PRICE, VOUCHER_ID, CANCELED
FROM STG.SALES_POS_UPDATE su
WHERE su.ERROR_TEXT IS NULL
);
DELETE FROM STG.SALES_POS_UPDATE
WHERE error_text IS NULL;
In many cases, the staging table not only contains new entries but also updates of existing rows.
The sales positions that should be removed from the SALES_POSITIONS
table might be located in the staging table too.
For this purpose, you could omit the primary key check and replace the INSERT
by an appropriate MERGE
statement as shown below:
Avoid small data inserts
Avoid any small data inserts, especially single-row IMPORT or INSERT. Multi-row inserts are faster than multiple single-row INSERT
statements.
Use IMPORT instead of INSERT
Use the IMPORT statement over INSERT for better performance. If you are loading a small amount of data, such as 100 rows or less, using the INSERT
statement might be faster. For larger amounts of data (10,000 rows or more), IMPORT
is the fastest and the best way to load data.
Transactions
Avoid multiple INSERTS
statements into the same table, and instead use the bulk load option. Multiple INSERT
statements from different sessions done simultaneously on the same tables will very likely lead to transaction conflicts. You may also receive the WAIT FOR COMMIT
message and experience further delays.
Use bulk insert
To enhance the performance of data insertion, use the bulk insert option. To bulk insert data, group multiple rows in a single multi-row INSERT
statement. Choose the bulk size as large as possible.
Data import and export
Instead of a separate bulk loading tool, Exasol provides the integrated SQL commands IMPORT and EXPORT to exchange data with many different types of systems. You can integrate data from flat files, databases, Hadoop, or any type of data source. You can also write your own connectors by using ETL UDFs.
Use IMPORT
and EXPORT
to transfer data between Exasol and:
- Other Exasol databases
- Other databases through a JDBC interface
- Oracle databases, using Oracle Call Interface
- Files (CSV, FBV)
- UDF Scripts (including Hadoop)
To achieve optimal parallelization:
-
If you import from an Exasol database, importing is always parallelized. Loading tables directly is then significantly faster than using the
STATEMENT
option. -
If you import data from Oracle sources, partitioned tables will be loaded in parallel.
-
If you import data from JDBC or Oracle sources, specifying multiple
STATEMENT
clauses will allow them to be executed in parallel, which may improve performance compared to loading all data as a singleSTATEMENT
. You can only specify multiple statements for JDBC and Oracle sources.