This section explains how to load data from Amazon S3 into Exasol in parallel by using UDF scripting.
You can load data into Exasol from Amazon S3 in parallel by using a Lua script and a Python UDF (user defined function). This script allows you to load data from multiple files in S3 into one table in Exasol by establishing multiple connections between S3 and Exasol.
For more information about UDFs, see Scripting.
For information on how to load data from Amazon S3 using the IMPORT command, see Load Data from Amazon S3 Using IMPORT.
This operation is only supported for files stored on Amazon S3 on AWS. Other cloud services that use an S3 compatible protocol are not supported.
You need an AWS account with a valid AWS access key ID and a corresponding AWS secret key, or you need to work on an instance that does not need access keys.
Connect to the Exasol database using a database client.
Download the script file s3_to_exasol.sql from the Exasol GitHub repository.
Locate the following two statements (at the end of the script file) and modify them to match your configurations.
create or replace connection S3_IMPORT_BOTO
TO 'https://<my_bucketname>.s3.<my_region>.amazonaws.com'
USER '<my_access_key>'
IDENTIFIED BY '<my_secret_key>';
Placeholder | Description |
---|---|
<my_bucketname>
|
The name of the Amazon S3 bucket |
<my_region>
|
The AWS region, for example: eu-west-1 |
<my_access_key>
|
AWS access key ID |
<my_secret_key>
|
AWS secret key |
If the AWS bucket does not require authentication, you can omit the clauses USER
and IDENTIFIED BY
.
execute script DATABASE_MIGRATION.s3_parallel_read(
true -- execute_statements
, false -- force_reload
, 'S3_LOADER_LOGGING' -- logging_schema
, '<schema>' -- schema_name
, '<table>' -- table_name
, NULL -- number_cols
, 'S3_IMPORT_BOTO' -- connection_name
, '<folder>/' -- folder_name
, '*.csv' -- filter_string
, 4 -- parallel_connections
, 'ENCODING=''UTF-8'' SKIP=0 ROW SEPARATOR = ''LF'' COLUMN SEPARATOR = '',''' -- file_opts
)
;
Parameter | Type | Description |
---|---|---|
execute_statements
|
boolean |
|
force_reload
|
boolean | true = target table and logging table will be truncated, all files in the bucket will be loaded again |
logging_schema
|
string | The schema to use for logging tables |
schema_name
|
string | The schema to load data into |
table_name
|
string | The table to load data into |
number_cols
|
integer |
If set to a number If you want to import all columns of the file, use |
connection_name
|
string |
The name of the connection created in the preceding statement ( |
folder_name
|
string |
The name of the bucket folder that you want to import. To import everything from the bucket, use an empty string NOTE: Regular expressions are not supported. |
filter_string
|
string |
Filter for filenames. For example, To import all files, use an empty string |
parallel_connections
|
integer | The number of parallel connections that you want to use |
file_opts
|
- | Additional file import options. For more details, see file_opts. |
Once you have updated the parameters as needed, execute all the statements in the file s3_to_exasol.sql
.
The statements S3_GET_FILENAMES
, GET_CONNECTION_NAME
, and S3_PARALLEL_READ
generate the UDFs and the script that are required for the import. The statement execute script DATABASE_MIGRATION.s3_parallel_read
will perform the actual import.
The first three statements are only required the first time that you perform the import on a database. After that, you only need to run the execute script
statement to perform the import, since the UDFs and the scripts are already created.
The execute script
statement generates a table in the schema defined in logging_schema
. This table has three rows: filename, last_modified, and status, and is used to keep track of the import status. Only files that have been modified or added since the last time the script was executed will be imported. If you want to import files that have already been imported, set the parameter force_reload
to true
.