Best Practices

This section provides you with the ETL best practices for Exasol.

Switch from ETL to ELT

ETL (Extract, Transform, Load) is one of the most commonly used methods for transferring data from a source system to a database. However, there are cases where you might want to use ELT instead.

An ETL process involves extracting the data from sources, transforming it, and then loading this data into the database.

An ELT (Extract, Load, Transform) process involves extracting data from sources, loading it into a staging area in the database, transforming the data using SQL (adjusting formats, checking primary keys and foreign keys, checking data quality, normalizing data), and then integrating data into the target schema. The staging area here is usually a schema within the database which buffers the data for the transformation. The transformed data is then integrated into the corresponding target schema of the same Exasol database instance.

The advantages of using a staging area for ELT process are:

  • Since the transformation is done from within the database, the full cluster performance is utilized.
  • It provides fast and easy integration of data from the staging area into the target schema.
  • There is no contamination of data in the target schema with the temporary staging data.

Data Transformation

When transforming the source data within Exasol, consider a change in paradigm. Instead of processing data row-based, processing data as data sets is more appropriate. When transforming data as data sets, the process is as follows:

  • Identify the possible error cases
  • Check all rows and set error codes
  • Integrate all correct rows in the target tables
  • Optional – keep invalid data on the staging area for further processing

Primary Key Constraint Check

Since Exasol does not support cursors, execute all checks on the data within the staging area, and integrate only the transformed and checked data into the target schema. For example:

Initial State: 
- Fact table SALES_POSITIONS - PK (SALES_ID, POSITION_ID) - FK (SALES_ID)

– Staging table SALES_POS_UPDATE – This includes only the relevant columns.

Therefore, only the new sales positions are transferred from the staging table to target schema, leaving the erroneous entries in the staging table with an appropriate error message.

As an example for set-based approach to the above:

UPDATE STG.SALES_POS_UPDATE su
  SET error_text='SALES_ID already exists'
  WHERE EXISTS
    (
      SELECT 1 FROM RETAIL.SALES_POSITIONS s
      WHERE s.SALES_ID = su.SALES_ID
    )
;
INSERT INTO RETAIL.SALES_POSITIONS
  (SELECT SALES_ID, POSITION_ID, ARTICLE_ID, AMOUNT, 
          PRICE, VOUCHER_ID, CANCELED
     FROM STG.SALES_POS_UPDATE su
     WHERE su.ERROR_TEXT IS NULL
  );
DELETE FROM STG.SALES_POS_UPDATE 
WHERE error_text IS NULL;

In many cases, the staging table not only contains new entries but also updates of existing rows. The sales positions that should be removed from the SALES_POSITIONS table might be located in the staging table too. For this purpose, you could omit the primary key check and replace the INSERT by an appropriate MERGE statement as shown below: 

MERGE INTO RETAIL.SALES_POSITIONS s
USING STG.SALES_POS_UPDATE su
   ON s.SALES_ID    = su.SALES_ID 
  AND s.POSITION_ID = su.POSITION_ID
WHEN MATCHED THEN UPDATE SET
      s.VOUCHER_ID = su.VOUCHER_ID
WHEN NOT MATCHED THEN INSERT VALUES
      (SALES_ID, POSITION_ID, ARTICLE_ID, AMOUNT, 
     PRICE, VOUCHER_ID, CANCELED)
;

Avoid Small Data Inserts

As a best practice, avoid any small data inserts, especially single-row IMPORT or INSERT. Multi-row inserts are faster than multiple single-row INSERT statements.

IMPORT vs. INSERT

We recommend you use the IMPORT statement over INSERT as it performs better than the INSERT statement. If you are loading smaller data (say about 100 rows), using INSERT statement might be faster. However, for larger amounts of data (10,000 plus rows), IMPORT is the fastest and the best way to load data.

Transactions

Multiple INSERT statements from different sessions done simultaneously will very likely lead to transaction conflicts if done towards the same tables or you would receive the WAIT FOR COMMIT message and may experience further delays. Hence, it is best to avoid multiple INSERTS statements into the same table and instead use the bulk load option.

Use Bulk Insert

To enhance the performance of data insertion, use bulk insert option. To bulk-insert data, group multiple rows in a single multi-row INSERT statement. Choose the bulk size as large as possible.

Data Import / Export

Instead of a separate bulk loading tool, Exasol provides integrated SQL commands IMPORT and EXPORT to ingest and extract data from / to all kinds of systems. You can integrate data from flat files, databases, Hadoop, or any type of data source. You can even write your own connectors by using ETL UDFs.

Use IMPORT / EXPORT to transfer data between Exasol and

  • Exasol databases
  • Oracle databases (using Oracle OCI)
  • Other databases via a JDBC interface
  • Files (CSV, FBV)
  • UDF Scripts (including Hadoop)

To achieve an optimal parallelization:

  • Importing from Exasol databases is always parallelized. For Exasol, loading tables directly is significantly faster than using the STATEMENT option.

  • If you import data from Oracle sources, partitioned tables will be loaded in parallel.

  • Specifying multiple STATEMENT clauses will allow them to be executed in parallel, which could improve performance compared to loading all data as a single STATEMENT. You can only specify multiple statements for JDBC and Oracle sources.