Load data from Amazon S3 in parallel using UDF
This article explains how to load data from Amazon S3 into Exasol in parallel by using UDF scripting.
You can load data into Exasol from Amazon S3 in parallel by using a Lua script and a user defined function (UDF) written in Python. The UDF script enables you to load data from multiple files on S3 into a single table in Exasol by establishing multiple connections.
This operation is only supported for files stored on Amazon S3 on AWS. Other cloud services that use an S3 compatible protocol are not supported. Exasol automatically recognizes an Amazon S3 import based on the URL that is used with the IMPORT statement.
For more information about how to use UDFs, see User Defined Functions (UDFs).
For information on how to load data from Amazon S3 using the IMPORT command, see Load data from Amazon S3 using IMPORT.
Prerequisites
You need an AWS account with a valid AWS access key ID and a corresponding AWS secret key, or you need to work on an instance that does not need access keys.
Procedure
-
Connect to the Exasol database using a database client.
-
Download the script file s3_to_exasol.sql from the Exasol GitHub repository.
-
Locate the following two statements (at the end of the script file) and modify them to match your configurations.
CREATE CONNECTION statement
Copycreate or replace connection S3_IMPORT_BOTO
TO 'https://<my_bucketname>.s3.<my_region>.amazonaws.com'
USER '<my_access_key>'
IDENTIFIED BY '<my_secret_key>';Placeholder Description <my_bucketname>The name of the Amazon S3 bucket <my_region>The AWS region, for example: eu-west-1<my_access_key>AWS access key ID <my_secret_key>AWS secret key If the AWS bucket does not require authentication, you can omit the clauses
USERandIDENTIFIED BY.EXECUTE SCRIPT statement
Copyexecute script DATABASE_MIGRATION.s3_parallel_read(
true -- execute_statements
, false -- force_reload
, 'S3_LOADER_LOGGING' -- logging_schema
, '<schema>' -- schema_name
, '<table>' -- table_name
, NULL -- number_cols
, 'S3_IMPORT_BOTO' -- connection_name
, '<folder>/' -- folder_name
, '*.csv' -- filter_string
, 4 -- parallel_connections
, 'ENCODING=''UTF-8'' SKIP=0 ROW SEPARATOR = ''LF'' COLUMN SEPARATOR = '',''' -- file_opts
)
;Parameter Type Description execute_statementsboolean true= statements are executed immediatelyfalse= statements are only generated but not executedforce_reloadboolean true= target table and logging table will be truncated, all files in the bucket will be loaded againlogging_schemastring The schema to use for logging tables schema_namestring The schema to load data into table_namestring The table to load data into number_colsinteger If set to a number
n, the firstncolumns will be imported.If you want to import all columns of the file, use
NULL.connection_namestring The name of the connection created in the preceding statement
(
S3_IMPORT_BOTOunless you changed it)folder_namestring The name of the bucket folder that you want to import.
To import everything from the bucket, use an empty string
('').NOTE: Regular expressions are not supported.
filter_stringstring Filter for filenames. For example,
'*_2018.csv'To import all files, use an empty string
('').parallel_connectionsinteger The number of parallel connections that you want to use file_opts- Additional file import options. For more details, see file_opts. -
Once you have updated the parameters as needed, execute all the statements in the file
s3_to_exasol.sql.
How this script works
The statements S3_GET_FILENAMES, GET_CONNECTION_NAME, and S3_PARALLEL_READ generate the UDFs and the script that are required for the import. The statement execute script DATABASE_MIGRATION.s3_parallel_read will perform the actual import.
The first three statements are only required the first time that you perform the import on a database. After that, you only need to run the execute script statement to perform the import, since the UDFs and the scripts are already created.
The execute script statement generates a table in the schema defined in logging_schema. This table has three rows: filename, last_modified, and status, and is used to keep track of the import status. Only files that have been modified or added since the last time the script was executed will be imported. If you want to import files that have already been imported, set the parameter force_reload to true.