This section provides you with the ETL best practices for Exasol.
Switch from ETL to ELT
ETL (Extract, Transform, Load) is one of the most commonly used methods for transferring data from a source system to a database. However, there are cases where you might want to use ELT instead.
An ETL process involves extracting the data from sources, transforming it, and then loading this data into the database.
An ELT (Extract, Load, Transform) process involves extracting data from sources, loading it into a staging area in the database, transforming the data using SQL (adjust formats, check primary keys, check foreign keys, check data quality, normalize data), and then integrating data into the target schema. The staging area here is usually a schema within the database which buffers the data for the transformation. The transformed data is then integrated into the corresponding target schema of the same Exasol database instance.
Advantages of using a staging area for ELT process:
- Since the transformation is done from within the database, the full cluster performance is utilized.
- It provides fast and easy integration of data from the staging area into the target schema.
- There is no contamination of data in the target schema with the temporary staging data.
When transforming the source data within Exasol, consider a change in paradigm. Instead of processing data row-based, processing data as data sets are more appropriate. When transforming data as data sets, the process is as follows:
- Identify the possible error cases
- Check all rows and set error codes
- Integrate all correct rows in the target tables
- Optional – keep invalid data on the staging area for further processing
Primary Key Constraint Check
Since Exasol supports no cursors, execute all checks on the data within the staging area, and integrate only the transformed and checked data into the target schema. For example:
- Fact table SALES_POSITIONS - PK (SALES_ID, POSITION_ID) - FK (SALES_ID)
– Staging table SALES_POS_UPDATE – This includes only the relevant columns.
Therefore, only the new sales position is transferred from the staging table to target schema, leaving the erroneous entries in the staging table with an appropriate error message.
As an example for set-based approach to the above:
UPDATE STG.SALES_POS_UPDATE su SET error_text='SALES_ID already exists' WHERE EXISTS ( SELECT 1 FROM RETAIL.SALES_POSITIONS s WHERE s.SALES_ID = su.SALES_ID ) ; INSERT INTO RETAIL.SALES_POSITIONS (SELECT SALES_ID, POSITION_ID, ARTICLE_ID, AMOUNT, PRICE, VOUCHER_ID, CANCELED FROM STG.SALES_POS_UPDATE su WHERE su.ERROR_TEXT IS NULL ); DELETE FROM STG.SALES_POS_UPDATE WHERE error_text IS NULL;
In many cases, the staging table not only contains new entries but also updates of existing rows. The sales positions that should be removed from the SALES_POSITIONS table might be located in the staging table too. For this purpose, you could omit the primary key check and replace the INSERT by an appropriate MERGE statement as shown below:
MERGE INTO RETAIL.SALES_POSITIONS s USING STG.SALES_POS_UPDATE su ON s.SALES_ID = su.SALES_ID AND s.POSITION_ID = su.POSITION_ID WHEN MATCHED THEN UPDATE SET s.VOUCHER_ID = su.VOUCHER_ID WHEN NOT MATCHED THEN INSERT VALUES (SALES_ID, POSITION_ID, ARTICLE_ID, AMOUNT, PRICE, VOUCHER_ID, CANCELED) ;
Avoid Small Data Inserts
IMPORT vs. INSERT
We recommend you use the IMPORT statement over INSERT as it performs better than the INSERT statement. If you are loading smaller data (say about 100 rows), using INSERT statement might be faster. However, for larger amounts of data (10,000 plus rows), IMPORT is the fastest and the best way to load data.
Multiple INSERT statements from different sessions done simultaneously will very likely lead to transaction conflicts if done towards the same tables or you would receive the WAIT FOR COMMIT message and may experience further delays. Hence, it is best to avoid multiple INSERTS statements into the same table and instead use the bulk load option.
Use Bulk Insert
To enhance the performance of data insertion, use bulk insert option. To bulk-insert data, group multiple rows in a single multi-row INSERT statement. Choose the bulk size as large as possible.
Data Import / Export
Instead of a separate bulk loading tool, Exasol provides integrated SQL commands IMPORT and EXPORT to ingest and extract data from / to all kinds of systems. You can integrate data from flat files, databases, Hadoop, or any type of data source. You can even write your own connectors by using ETL UDFs.
Use IMPORT / EXPORT to transfer data between Exasol and
- Exasol databases
- Oracle databases (using Oracle OCI)
- Generic databases (using a JDBC driver)
- Files (CSV, FBV)
- UDF Scripts (including Hadoop)
To achieve an optimal parallelization:
- Importing from Exasol databases is always parallelized. For Exasol, loading tables directly is significantly faster than using the STATEMENT option.
- If you import data from Oracle sources, partitioned tables will be loaded in parallel.
- Specifying multiple statements is only possible for JDBC and Oracle sources.