Load Data from Amazon S3 in Parallel Using UDF

This article explains how to load data from Amazon S3 into Exasol in parallel by using UDF scripting.

You can load data into Exasol from Amazon S3 in parallel by using a Lua script and a Python UDF (user defined function). This script allows you to load data from multiple files in S3 into one table in Exasol by establishing multiple connections between S3 and Exasol.

For more information about UDFs, see Scripting.

For information on how to load data from Amazon S3 using the IMPORT command, see Load Data from Amazon S3 Using IMPORT.

This operation is only supported for files stored on Amazon S3 on AWS. Other cloud services that use an S3 compatible protocol are not supported.

Prerequisites

You need an AWS account with a valid AWS access key ID and a corresponding AWS secret key, or you need to work on an instance that does not need access keys.

Procedure

  1. Connect to the Exasol database using a database client.

  2. Download the script file s3_to_exasol.sql from the Exasol GitHub repository.

  3. Locate the following two statements (at the end of the script file) and modify them to match your configurations.

    CREATE CONNECTION statement

    create or replace connection S3_IMPORT_BOTO
        TO 'https://<my_bucketname>.s3.<my_region>.amazonaws.com'
        USER '<my_access_key>'
        IDENTIFIED BY '<my_secret_key>';
    Placeholder Description
    <my_bucketname> The name of the Amazon S3 bucket
    <my_region> The AWS region, for example: eu-west-1
    <my_access_key> AWS access key ID
    <my_secret_key> AWS secret key

    If the AWS bucket does not require authentication, you can omit the clauses USER and IDENTIFIED BY.

    EXECUTE SCRIPT statement

    execute script DATABASE_MIGRATION.s3_parallel_read(
      true                  -- execute_statements
    , false                 -- force_reload
    , 'S3_LOADER_LOGGING'   -- logging_schema
    , '<schema>'            -- schema_name
    , '<table>'             -- table_name
    , NULL                  -- number_cols
    , 'S3_IMPORT_BOTO'      -- connection_name
    , '<folder>/'           -- folder_name
    , '*.csv'               -- filter_string
    , 4                     -- parallel_connections
    , 'ENCODING=''UTF-8'' SKIP=0  ROW SEPARATOR = ''LF'' COLUMN SEPARATOR = '',''' -- file_opts
    )
    ;
    Parameter Type Description
    execute_statements boolean

    true = statements are executed immediately

    false = statements are only generated but not executed

    force_reload boolean true = target table and logging table will be truncated, all files in the bucket will be loaded again
    logging_schema string The schema to use for logging tables
    schema_name string The schema to load data into
    table_name string The table to load data into
    number_cols integer

    If set to a number n, the first n columns will be imported.

    If you want to import all columns of the file, use NULL.

    connection_name string

    The name of the connection created in the preceding statement

    (S3_IMPORT_BOTO unless you changed it)

    folder_name string

    The name of the bucket folder that you want to import.

    To import everything from the bucket, use an empty string ('').

    NOTE: Regular expressions are not supported.

    filter_string string

    Filter for filenames. For example, '*_2018.csv'

    To import all files, use an empty string ('').

    parallel_connections integer The number of parallel connections that you want to use
    file_opts - Additional file import options. For more details, see file_opts.
  4. Once you have updated the parameters as needed, execute all the statements in the file s3_to_exasol.sql.

How this script works

The statements S3_GET_FILENAMES, GET_CONNECTION_NAME, and S3_PARALLEL_READ generate the UDFs and the script that are required for the import. The statement execute script DATABASE_MIGRATION.s3_parallel_read will perform the actual import.

The first three statements are only required the first time that you perform the import on a database. After that, you only need to run the execute script statement to perform the import, since the UDFs and the scripts are already created.

The execute script statement generates a table in the schema defined in logging_schema. This table has three rows: filename, last_modified, and status, and is used to keep track of the import status. Only files that have been modified or added since the last time the script was executed will be imported. If you want to import files that have already been imported, set the parameter force_reload to true.