Best Practices

This section describes best practices for loading data into Exasol.

Use ELT instead of ETL

ETL (Extract, Transform, Load) is one of the most commonly used methods for transferring data from a source system to a database. However, there are cases where you might want to use ELT instead.

An ETL process involves extracting the data from sources, transforming it, and then loading this data into the database.

ELT (Extract, Load, Transform) process involves extracting data from sources, loading it into a staging area in the database, transforming the data using SQL (adjusting formats, checking primary keys and foreign keys, checking data quality, normalizing data), and then integrating data into the target schema. The staging area here is usually a schema within the database which buffers the data for the transformation. The transformed data is then integrated into the corresponding target schema of the same Exasol database instance.

The advantages of using a staging area for ELT process are:

  • Since the transformation is done from within the database, the full cluster performance is utilized.
  • It provides fast and easy integration of data from the staging area into the target schema.
  • There is no contamination of data in the target schema with the temporary staging data.

Data transformation

When transforming the source data within Exasol, consider a change in paradigm. Instead of processing data row-based, processing data as data sets is more appropriate. When transforming data as data sets, the process is as follows:

When transforming the source data within Exasol, processing data as data sets is more efficient than row-based processing. When transforming data as data sets, the process is as follows:

  • Identify the possible error cases
  • Check all rows and set error codes
  • Integrate all correct rows in the target tables
  • Optional – keep invalid data on the staging area for further processing

Primary key constraint check

Since Exasol does not support cursors, execute all checks on the data within the staging area, and integrate only the transformed and checked data into the target schema. For example:

Initial State: 
- Fact table SALES_POSITIONS - PK (SALES_ID, POSITION_ID) - FK (SALES_ID)

– Staging table SALES_POS_UPDATE – This includes only the relevant columns.

Therefore, only the new sales positions are transferred from the staging table to target schema, leaving the erroneous entries in the staging table with an appropriate error message.

As an example for set-based approach to the above:

UPDATE STG.SALES_POS_UPDATE su
  SET error_text='SALES_ID already exists'
  WHERE EXISTS
    (
      SELECT 1 FROM RETAIL.SALES_POSITIONS s
      WHERE s.SALES_ID = su.SALES_ID
    )
;
INSERT INTO RETAIL.SALES_POSITIONS
  (SELECT SALES_ID, POSITION_ID, ARTICLE_ID, AMOUNT, 
          PRICE, VOUCHER_ID, CANCELED
     FROM STG.SALES_POS_UPDATE su
     WHERE su.ERROR_TEXT IS NULL
  );
DELETE FROM STG.SALES_POS_UPDATE 
WHERE error_text IS NULL;

In many cases, the staging table not only contains new entries but also updates of existing rows. The sales positions that should be removed from the SALES_POSITIONS table might be located in the staging table too. For this purpose, you could omit the primary key check and replace the INSERT by an appropriate MERGE statement as shown below: 

MERGE INTO RETAIL.SALES_POSITIONS s
USING STG.SALES_POS_UPDATE su
   ON s.SALES_ID    = su.SALES_ID 
  AND s.POSITION_ID = su.POSITION_ID
WHEN MATCHED THEN UPDATE SET
      s.VOUCHER_ID = su.VOUCHER_ID
WHEN NOT MATCHED THEN INSERT VALUES
      (SALES_ID, POSITION_ID, ARTICLE_ID, AMOUNT, 
     PRICE, VOUCHER_ID, CANCELED)
;

Avoid small data inserts

Avoid any small data inserts, especially single-row IMPORT or INSERT. Multi-row inserts are faster than multiple single-row INSERT statements.

Use IMPORT instead of INSERT

Use the IMPORT statement over INSERT for better performance. If you are loading a small amount of data, such as 100 rows or less, using the INSERT statement might be faster. For larger amounts of data (10,000 rows or more), IMPORT is the fastest and the best way to load data.

Transactions

Avoid multiple INSERTS statements into the same table, and instead use the bulk load option. Multiple INSERT statements from different sessions done simultaneously on the same tables will very likely lead to transaction conflicts. You may also receive the WAIT FOR COMMIT message and experience further delays.

Use bulk insert

To enhance the performance of data insertion, use the bulk insert option. To bulk insert data, group multiple rows in a single multi-row INSERT statement. Choose the bulk size as large as possible.

Data import and export

Instead of a separate bulk loading tool, Exasol provides the integrated SQL commands IMPORT and EXPORT to exchange data with many different types of systems. You can integrate data from flat files, databases, Hadoop, or any type of data source. You can also write your own connectors by using ETL UDFs.

Use IMPORT and EXPORT to transfer data between Exasol and:

  • Other Exasol databases
  • Other databases through a JDBC interface
  • Oracle databases, using Oracle Call Interface
  • Files (CSV, FBV)
  • UDF Scripts (including Hadoop)

To achieve optimal parallelization: 

  • If you import from an Exasol database, importing is always parallelized. Loading tables directly is then significantly faster than using the STATEMENT option.

  • If you import data from Oracle sources, partitioned tables will be loaded in parallel.

  • If you import data from JDBC or Oracle sources, specifying multiple STATEMENT clauses will allow them to be executed in parallel, which may improve performance compared to loading all data as a single STATEMENT. You can only specify multiple statements for JDBC and Oracle sources.