Load Data from Amazon S3 in Parallel Using UDF

This article explains how to load data from Amazon S3 into Exasol in parallel by using UDF scripting.

You can load data into Exasol from Amazon S3 in parallel by using a Lua script and a user defined function (UDF) written in Python. The UDF script allows you to load data from multiple files in S3 into one table in Exasol by establishing multiple connections between S3 and Exasol.

This operation is only supported for files stored on Amazon S3 on AWS. Other cloud services that use an S3 compatible protocol are not supported.

For more information about how to use UDFs, see UDF Scripts.

For information on how to load data from Amazon S3 using the IMPORT command, see Load Data from Amazon S3 Using IMPORT.

Prerequisites

You need an AWS account with a valid AWS access key ID and a corresponding AWS secret key, or you need to work on an instance that does not need access keys.

Procedure

Connect to the Exasol database using a database client.
Download the script file s3_to_exasol.sql from the Exasol GitHub repository.

Locate the following two statements (at the end of the script file) and modify them to match your configurations.

CREATE CONNECTION statement

Copy

create or replace connection S3_IMPORT_BOTO
    TO 'https://<my_bucketname>.s3.<my_region>.amazonaws.com'
    USER '<my_access_key>'
    IDENTIFIED BY '<my_secret_key>';

Placeholder	Description
`<my_bucketname>`	The name of the Amazon S3 bucket
`<my_region>`	The AWS region, for example: `eu-west-1`
`<my_access_key>`	AWS access key ID
`<my_secret_key>`	AWS secret key

If the AWS bucket does not require authentication, you can omit the clauses USER and IDENTIFIED BY.

EXECUTE SCRIPT statement

Copy

execute script DATABASE_MIGRATION.s3_parallel_read(
  true                  -- execute_statements
, false                 -- force_reload
, 'S3_LOADER_LOGGING'   -- logging_schema
, '<schema>'            -- schema_name
, '<table>'             -- table_name
, NULL                  -- number_cols
, 'S3_IMPORT_BOTO'      -- connection_name
, '<folder>/'           -- folder_name
, '*.csv'               -- filter_string
, 4                     -- parallel_connections
, 'ENCODING=''UTF-8'' SKIP=0  ROW SEPARATOR = ''LF'' COLUMN SEPARATOR = '',''' -- file_opts
)
;

Parameter	Type	Description
`execute_statements`	boolean	`true` = statements are executed immediately `false` = statements are only generated but not executed
`force_reload`	boolean	`true` = target table and logging table will be truncated, all files in the bucket will be loaded again
`logging_schema`	string	The schema to use for logging tables
`schema_name`	string	The schema to load data into
`table_name`	string	The table to load data into
`number_cols`	integer	If set to a number `n`, the first `n` columns will be imported. If you want to import all columns of the file, use `NULL`.
`connection_name`	string	The name of the connection created in the preceding statement (`S3_IMPORT_BOTO` unless you changed it)
`folder_name`	string	The name of the bucket folder that you want to import. To import everything from the bucket, use an empty string `('')`. NOTE: Regular expressions are not supported.
`filter_string`	string	Filter for filenames. For example, `'*_2018.csv'` To import all files, use an empty string `('')`.
`parallel_connections`	integer	The number of parallel connections that you want to use
`file_opts`	-	Additional file import options. For more details, see file_opts.

Once you have updated the parameters as needed, execute all the statements in the file s3_to_exasol.sql.

How this script works

The statements S3_GET_FILENAMES, GET_CONNECTION_NAME, and S3_PARALLEL_READ generate the UDFs and the script that are required for the import. The statement execute script DATABASE_MIGRATION.s3_parallel_read will perform the actual import.

The first three statements are only required the first time that you perform the import on a database. After that, you only need to run the execute script statement to perform the import, since the UDFs and the scripts are already created.

The execute script statement generates a table in the schema defined in logging_schema. This table has three rows: filename, last_modified, and status, and is used to keep track of the import status. Only files that have been modified or added since the last time the script was executed will be imported. If you want to import files that have already been imported, set the parameter force_reload to true.