Load data from Parquet files in Amazon S3 on AWS
This article explains how to load data from Parquet files in Amazon S3 buckets on AWS using IMPORT.
In Exasol 2025.1 and later you can use the IMPORT command to load data from Parquet files in Amazon S3 buckets on AWS using the native bulk loader in Exasol. This allows multiple Parquet files to be imported in parallel across nodes, enabling higher throughput for larger datasets. Single Parquet files can be read using multiple parallel operations for faster and more efficient data read times.
This operation is only supported for files stored on Amazon S3 on AWS. Other cloud services that use an S3 compatible protocol are not supported. Exasol automatically recognizes an Amazon S3 import based on the URL that is used with the IMPORT statement.
This feature is only available in Exasol 2025.1 and later.
Overview
This section provides an overview of the key features and configuration options that are available when importing Parquet files from Amazon S3 buckets on AWS using the IMPORT command.
To learn how to set the configuration options when importing Parquet files, see Configuration parameters.
For examples of how to connect and import files using different options, see Examples.
Amazon S3 URL format support
The following Amazon S3 URL formats are supported:
URL format | Example |
---|---|
https://<bucketname>.s3.amazonaws.com | https://my_bucket.s3.amazonaws.com |
https://<bucketname>.s3.<aws_region>.amazonaws.com | https://my_bucket.s3.eu-west-1.amazonaws.com |
s3://<bucketname> | s3://my_bucket |
The AWS region is automatically detected if omitted in the URL.
AWS authentication options
Authentication using access key ID and secret access key is supported for seamless access to AWS resources. Additional support for session tokens allows secure short-lived credential-based authentication for temporary access to AWS.
For more information about AWS authentication methods, see AWS Identity and Access Management.
We recommend that you always create a connection object instead of providing the authentication details directly in the IMPORT command. To learn more, see Create a connection.
Supported data types
The following Parquet data types are supported:
INT8
, INT16
, INT32
, INT64
, UINT8
, UINT16
, UINT32
, UINT64
, FLOAT
, DOUBLE
Additional supported data types:
Decimal128
, String
, LargeString
, Date32
, Date64
, Bool
, Timestamp
Source data tracking (SQL syntax extension)
Additional SQL options enable row-origin tracking by appending SHA256 file hash and source row number as metadata during the import. The target table must contain suitable columns to store this metadata.
SQL option | Description | Column type |
---|---|---|
SOURCE FILE HASH_SHA256
|
Allows secure mapping of rows to their respective file origins. |
|
SOURCE ROW NUMBER
|
Captures the position of the row in the source file (0-based index). | DECIMAL(p,s)
|
For examples of use, see Examples.
Column skipping during import
The SkipCols
configuration parameter enables selective exclusion of columns from the imported Parquet data based on their index positions (0-based). This provides precise control over which columns to include or exclude while importing data. Skipped columns are defined as part of the connection string using ranges or discrete column indices.
For more information and examples of use, see Configuration parameters and Examples.
Configuration parameters
The following configuration parameters can be set using either the connection string for query-specific configurations, or using database-level parameters for global defaults.
-
Connection string parameters override database parameter values when both are provided.
-
Default values for connection string parameters are sourced from database parameters if not expressly set.
-
Connection string parameters are case-insensitive.
Connection string parameter | Database parameter | Description | Default value |
---|---|---|---|
MaxConnections
|
-etlParquetMaxConnections
|
Number of Parquet files imported in parallel on a single node. |
8
|
MaxConcurrentReads
|
-etlParquetMaxConcurrentReads
|
Number of parallel read operations per file. Set to |
3
|
MaxBatchFetchSize
|
-etlParquetMaxBatchFetchSize
|
Maximum simultaneous buffers fetched during import. | 1
|
SkipCols
|
(none) |
Skips specific column indexes during import. Example: This configuration excludes columns 1, 3, 4, 5, 6, 7, 8, and 11. All other columns remain part of the imported dataset. |
(none) |
Create a connection
We recommend that you create a connection object using CREATE CONNECTION and authenticate on AWS using the saved connection instead of providing authentication details directly in the IMPORT command.
CREATE OR REPLACE CONNECTION <my_connection>
TO 'https://<bucketname>.s3[.<aws_region>].amazonaws.com'
USER '<access_key>' IDENTIFIED BY '<secret_key>'
[SESSION TOKEN '<session_token>']
;
Placeholder | Description |
---|---|
<bucketname>
|
The name of the Amazon S3 bucket |
<aws_region>
|
The AWS region, for example: eu-west-1 |
<access_key>
|
AWS access key ID |
<secret_key>
|
AWS secret key |
<session_token>
|
AWS session token (optional) |
Examples
Example 1 - schema and table creation with file tracking
In this example we create a connection to the S3 bucket, create a new schema and table, then load data from two Parquet files, using SQL options to store source metadata for row-origin tracking.
CREATE OR REPLACE CONNECTION my_s3_bucket
TO 'https://my_s3_bucket.s3.eu-west-1.amazonaws.com'
USER 'my_user' IDENTIFIED BY 'my_secret_key'
;
DROP SCHEMA IF EXISTS MY_SCHEMA CASCADE;
CREATE SCHEMA MY_SCHEMA;
CREATE TABLE MY_SCHEMA.MY_TABLE (sourcefile HASHTYPE (256 BIT), c1 INT, rownumber INT);
IMPORT INTO MY_SCHEMA.MY_TABLE
FROM PARQUET AT my_s3_bucket
FILE 'nested/path/to/test_1.parquet' FILE 'nested/path/to/test_2.parquet'
WITH
SOURCE FILE HASH_SHA256 = sourcefile
SOURCE ROW NUMBER = rownumber
;
Example 2 - using configuration parameters
In this example the connection string includes configuration parameters that will do the following:
- Limit the number of files to import concurrently to 2
- Increase the number of buffers to fetch during the import to 4
- Exclude columns 1, 3, 4, 5, 6, 7, 8, and 11 from the imported dataset
CREATE OR REPLACE CONNECTION my_s3_bucket
TO 'https://my_s3_bucket.s3.eu-west-1.amazonaws.com;MaxConnections=2;MaxBatchFetchSize=4;SkipCols=1,3..8,11'
USER 'my_user' IDENTIFIED BY 'my_secret_key'
;