Load Data from Amazon S3 in Parallel Using UDFs
You can load data from Amazon S3 into Exasol in parallel using a Lua script and a Python UDF. This script allows you to load data from multiple files in S3 into one table in Exasol by establishing multiple connections between S3 and Exasol.
If you are using a version of Exasol prior to Exasol 6.0.10, you need to first add Boto library to your Exasol system. Boto library is a Python interface for Amazon Web Services. It allows you to list folders and the contents within it in S3. Additionally, Boto needs to be available on every node in the cluster to access it in the Python UDF. This is achieved by uploading the Boto package to BucketFS - Exasol's replicated file system. For more information, see Loading Data from Amazon S3 in Parallel for Exasol 6.1.
You need an AWS account with a valid AWS-AccessKeyId and a corresponding AWS-SecretAccessKey or you need to work on an instance that does not need access keys.
To start a parallel import, follow the steps below:
- From your preferred SQL client connect to the Exasol database. Next, create a connection to S3 by modifying Connection name, URL, and access keys in the following statement and executing it:
- Download the python script file s3_to_Exasol.sql from the GitHub repository. The python script file, in turn, contains the following three script that you will need to import data:
- S3_GET_FILENAMES, a Python UDF
- GET_CONNECTION_NAME, a Lua UDF
- S3_PARALLEL_READ, a Lua script
- Next, look for execute script DATABASE_MIGRATION.s3_parallel_read at the bottom of the file and modify it to adapt to your configurations. For example:
- Execute all the statements in the file (s3_to_exasol.sql). The first three statements (S3_GET_FILENAMES, GET_CONNECTION_NAME, S3_PARALLEL_READ) within the file will generate the UDFs and the script necessary for import, and the execute script DATABASE_MIGRATION.s3_parallel_read will start the import.
- Executing the script generates a table in the schema S3_IMPORT_LOGGING. This table has three rows: filename, last_modified, and status.
The table is used to keep track of the import status. Only files that have been modified or added since the last time the script was executed will be imported. If you want to import files that have already been imported, set the parameter force reload to true.
CREATE CONNECTION S3_MY_BUCKETNAME TO 'http://<my_bucketname>.s3.<my_region>.amazonaws.com' -- my_region could e.g. be eu-west-1 USER '<my_access_key>'IDENTIFIED BY '<my_secret_key>';
execute script DATABASE_MIGRATION.s3_parallel_read( true -- if true, statements are executed immediately, if false only statements are generated , false -- force reload: if true, table and logging table will be truncated, all files in bucket will be loaded again , 'S3_IMPORT_LOGGING' -- schema you want to use for the logging tables ,'PRODUCT' -- name of the schema that holds the table you want to import into ,'test' -- name of the table you want to import into ,'S3_MY_BUCKETNAME' -- connection name ( see statement above) ,'my_project/' -- folder name, if you want to import everything, leave blank , '' -- filter for file-names, to include all files, put empty string, for using only csv files, put '*.csv' ,4 -- number of parallel connections you want to use ,'ENCODING=''ASCII'' SKIP=1 ROW SEPARATOR = ''CRLF''' -- file options, see manual, section 'import' for further information ) ;
When you perform the import for the first time, you need to execute all the scripts in the file. The next time onwards, you only need to execute the execute script DATABASE_MIGRATION.s3_parallel_read statement as the UDFs and the scripts are already created.