Load Data from Amazon S3 in Parallel Using UDF
This article explains how to load data from Amazon S3 into Exasol in parallel by using UDF scripting.
You can load data into Exasol from Amazon S3 in parallel by using a Lua script and a Python UDF (user defined function). This script allows you to load data from multiple files in S3 into one table in Exasol by establishing multiple connections between S3 and Exasol.
For more information about UDFs, see Scripting.
For information on how to load data from Amazon S3 using the IMPORT command, see Load Data from Amazon S3 Using IMPORT.
This operation is only supported for files stored on Amazon S3 on AWS. Other cloud services that use an S3 compatible protocol are not supported.
Prerequisites
You need an AWS account with a valid AWS access key ID and a corresponding AWS secret key, or you need to work on an instance that does not need access keys.
Procedure
-
Connect to the Exasol database using a database client.
-
Download the script file s3_to_exasol.sql from the Exasol GitHub repository.
-
Locate the following two statements (at the end of the script file) and modify them to match your configurations.
CREATE CONNECTION statement
create or replace connection S3_IMPORT_BOTO
TO 'https://<my_bucketname>.s3.<my_region>.amazonaws.com'
USER '<my_access_key>'
IDENTIFIED BY '<my_secret_key>';Placeholder Description <my_bucketname>
The name of the Amazon S3 bucket <my_region>
The AWS region, for example: eu-west-1
<my_access_key>
AWS access key ID <my_secret_key>
AWS secret key If the AWS bucket does not require authentication, you can omit the clauses
USER
andIDENTIFIED BY
.EXECUTE SCRIPT statement
execute script DATABASE_MIGRATION.s3_parallel_read(
true -- execute_statements
, false -- force_reload
, 'S3_LOADER_LOGGING' -- logging_schema
, '<schema>' -- schema_name
, '<table>' -- table_name
, NULL -- number_cols
, 'S3_IMPORT_BOTO' -- connection_name
, '<folder>/' -- folder_name
, '*.csv' -- filter_string
, 4 -- parallel_connections
, 'ENCODING=''UTF-8'' SKIP=0 ROW SEPARATOR = ''LF'' COLUMN SEPARATOR = '',''' -- file_opts
)
;Parameter Type Description execute_statements
boolean true
= statements are executed immediatelyfalse
= statements are only generated but not executedforce_reload
boolean true
= target table and logging table will be truncated, all files in the bucket will be loaded againlogging_schema
string The schema to use for logging tables schema_name
string The schema to load data into table_name
string The table to load data into number_cols
integer If set to a number
n
, the firstn
columns will be imported.If you want to import all columns of the file, use
NULL
.connection_name
string The name of the connection created in the preceding statement
(
S3_IMPORT_BOTO
unless you changed it)folder_name
string The name of the bucket folder that you want to import.
To import everything from the bucket, use an empty string
('')
.NOTE: Regular expressions are not supported.
filter_string
string Filter for filenames. For example,
'*_2018.csv'
To import all files, use an empty string
('')
.parallel_connections
integer The number of parallel connections that you want to use file_opts
- Additional file import options. For more details, see file_opts. -
Once you have updated the parameters as needed, execute all the statements in the file
s3_to_exasol.sql
.
How this script works
The statements S3_GET_FILENAMES
, GET_CONNECTION_NAME
, and S3_PARALLEL_READ
generate the UDFs and the script that are required for the import. The statement execute script DATABASE_MIGRATION.s3_parallel_read
will perform the actual import.
The first three statements are only required the first time that you perform the import on a database. After that, you only need to run the execute script
statement to perform the import, since the UDFs and the scripts are already created.
The execute script
statement generates a table in the schema defined in logging_schema
. This table has three rows: filename, last_modified, and status, and is used to keep track of the import status. Only files that have been modified or added since the last time the script was executed will be imported. If you want to import files that have already been imported, set the parameter force_reload
to true
.