Load Data from Amazon S3 in Parallel Using UDFs

You can load data from Amazon S3 on AWS into Exasol in parallel using a Lua script and a Python UDF. This script allows you to load data from multiple files in S3 into one table in Exasol by establishing multiple connections between S3 and Exasol.

Only Amazon S3 on AWS is supported. Other services that use an S3 compatible protocol are not supported.

Prerequisites

You need an AWS account with a valid AWS-AccessKeyId and a corresponding AWS-SecretAccessKey or you need to work on an instance that does not need access keys.

Procedure

To start a parallel import, follow the steps below:

  1. From your preferred SQL client connect to the Exasol database. Next, create a connection to S3 by modifying Connection name, URL, and access keys in the following statement and executing it:
    CREATE CONNECTION S3_MY_BUCKETNAME
        TO 'http://<my_bucketname>.s3.<my_region>.amazonaws.com' -- my_region could e.g. be eu-west-1
        USER '<my_access_key>'
        IDENTIFIED BY '<my_secret_key>';
  2. Download the python script file s3_to_Exasol.sql from the GitHub repository. The python script file, in turn, contains the following three scripts that you will need to import data:
    • S3_GET_FILENAMES, a Python UDF
    • GET_CONNECTION_NAME, a Lua UDF
    • S3_PARALLEL_READ, a Lua script
  3. Next, look for execute script DATABASE_MIGRATION.s3_parallel_read at the bottom of the file and modify it to adapt to your configurations. For example:
    execute script DATABASE_MIGRATION.s3_parallel_read(
    true                          -- if true, statements are executed immediately, if false only statements are generated
    , false                        -- force reload: if true, table and logging table will be truncated, all files in bucket will be loaded again
    , 'S3_IMPORT_LOGGING'         -- schema you want to use for the logging tables
    ,'PRODUCT'                    -- name of the schema that holds the table you want to import into
    ,'test'                       -- name of the table you want to import into
    ,'S3_MY_BUCKETNAME'           -- connection name ( see statement above)
    ,'my_project/'                -- folder name, if you want to import everything, leave blank
    , ''                          -- filter for file-names, to include all files, put empty string, for using only CSV files, put '*.csv'
    ,4                            -- number of parallel connections you want to use
    ,'ENCODING=''ASCII'' SKIP=1  ROW SEPARATOR = ''CRLF''' -- file options, see manual, section 'import' for further information
    )
    ;
  4. Execute all the statements in the file (s3_to_exasol.sql). The first three statements (S3_GET_FILENAMES, GET_CONNECTION_NAME, S3_PARALLEL_READ) within the file will generate the UDFs and the script necessary for import, and the execute script DATABASE_MIGRATION.s3_parallel_read will start the import.
  5. When you perform the import for the first time, you need to execute all the scripts in the file. The next time onwards, you only need to execute the execute script DATABASE_MIGRATION.s3_parallel_read statement as the UDFs and the scripts are already created.

  6. Executing the script generates a table in the schema S3_IMPORT_LOGGING. This table has three rows: filename, last_modified, and status.
    The table is used to keep track of the import status. Only files that have been modified or added since the last time the script was executed will be imported. If you want to import files that have already been imported, set the parameter force reload to true.

To know how to use the IMPORT command to load data from Amazon S3 buckets, see Load Data from Amazon S3 Using IMPORT.