Load data from Parquet files in Amazon S3 on AWS

This article explains how to load data from Parquet files in Amazon S3 buckets on AWS using IMPORT.

In Exasol 2025.1 and later you can use the IMPORT command to load data from Parquet files in Amazon S3 buckets on AWS using the native bulk loader in Exasol. This allows multiple Parquet files to be imported in parallel across nodes, enabling higher throughput for larger datasets. Single Parquet files can be read using multiple parallel operations for faster and more efficient data read times.

This operation is only supported for files stored on Amazon S3 on AWS. Other cloud services that use an S3 compatible protocol are not supported. Exasol automatically recognizes an Amazon S3 import based on the URL that is used with the IMPORT statement.

This feature is only available in Exasol 2025.1 and later.

Overview

This section provides an overview of the key features and configuration options that are available when importing Parquet files from Amazon S3 buckets on AWS using the IMPORT command.

To learn how to set the configuration options when importing Parquet files, see Configuration parameters.

For examples of how to connect and import files using different options, see Examples.

Amazon S3 URL format support

The following Amazon S3 URL formats are supported:

URL format Example
https://<bucketname>.s3.amazonaws.com https://my_bucket.s3.amazonaws.com
https://<bucketname>.s3.<aws_region>.amazonaws.com https://my_bucket.s3.eu-west-1.amazonaws.com
s3://<bucketname> s3://my_bucket

The AWS region is automatically detected if omitted in the URL.

AWS authentication options

Authentication using access key ID and secret access key is supported for seamless access to AWS resources. Additional support for session tokens allows secure short-lived credential-based authentication for temporary access to AWS.

For more information about AWS authentication methods, see AWS Identity and Access Management.

We recommend that you always create a connection object instead of providing the authentication details directly in the IMPORT command. To learn more, see Create a connection.

Supported data types

The following Parquet data types are supported:

INT8, INT16, INT32, INT64, UINT8, UINT16, UINT32, UINT64, FLOAT, DOUBLE

Additional supported data types:

Decimal128, String, LargeString, Date32, Date64, Bool, Timestamp

Source data tracking (SQL syntax extension)

Additional SQL options enable row-origin tracking by appending SHA256 file hash and source row number as metadata during the import. The target table must contain suitable columns to store this metadata.

SQL option Description Column type
SOURCE FILE HASH_SHA256 Allows secure mapping of rows to their respective file origins.

CHAR(64)

VARCHAR(n) with n>=64

HASHTYPE (256 BIT)

SOURCE ROW NUMBER Captures the position of the row in the source file (0-based index). DECIMAL(p,s)

For examples of use, see Examples.

Column skipping during import

The SkipCols configuration parameter enables selective exclusion of columns from the imported Parquet data based on their index positions (0-based). This provides precise control over which columns to include or exclude while importing data. Skipped columns are defined as part of the connection string using ranges or discrete column indices.

For more information and examples of use, see Configuration parameters and Examples.

Configuration parameters

The following configuration parameters can be set using either the connection string for query-specific configurations, or using database-level parameters for global defaults.

  • Connection string parameters override database parameter values when both are provided.

  • Default values for connection string parameters are sourced from database parameters if not expressly set.

  • Connection string parameters are case-insensitive.

Connection string parameter Database parameter Description Default value
MaxConnections -etlParquetMaxConnections

Number of Parquet files imported in parallel on a single node.

8
MaxConcurrentReads -etlParquetMaxConcurrentReads

Number of parallel read operations per file.

Set to 0 or 1 for sequential reads.

3
MaxBatchFetchSize -etlParquetMaxBatchFetchSize Maximum simultaneous buffers fetched during import. 1
SkipCols (none)

Skips specific column indexes during import.

Example: SkipCols=1,3..8,11.

This configuration excludes columns 1, 3, 4, 5, 6, 7, 8, and 11. All other columns remain part of the imported dataset.

(none)

Create a connection

We recommend that you create a connection object using CREATE CONNECTION and authenticate on AWS using the saved connection instead of providing authentication details directly in the IMPORT command.

Copy
CREATE OR REPLACE CONNECTION <my_connection>
    TO 'https://<bucketname>.s3[.<aws_region>].amazonaws.com'
    USER '<access_key>' IDENTIFIED BY '<secret_key>'
    [SESSION TOKEN '<session_token>']
;
Placeholder Description
<bucketname> The name of the Amazon S3 bucket
<aws_region> The AWS region, for example: eu-west-1
<access_key> AWS access key ID
<secret_key> AWS secret key
<session_token> AWS session token (optional)

Examples

Example 1 - schema and table creation with file tracking

In this example we create a connection to the S3 bucket, create a new schema and table, then load data from two Parquet files, using SQL options to store source metadata for row-origin tracking.

Copy
CREATE OR REPLACE CONNECTION my_s3_bucket
    TO 'https://my_s3_bucket.s3.eu-west-1.amazonaws.com'
    USER 'my_user' IDENTIFIED BY 'my_secret_key'
;

DROP SCHEMA IF EXISTS MY_SCHEMA CASCADE;
CREATE SCHEMA MY_SCHEMA;
CREATE TABLE MY_SCHEMA.MY_TABLE (sourcefile HASHTYPE (256 BIT), c1 INT, rownumber INT);

IMPORT INTO MY_SCHEMA.MY_TABLE
    FROM PARQUET AT my_s3_bucket 
    FILE 'nested/path/to/test_1.parquet' FILE 'nested/path/to/test_2.parquet' 
    WITH 
      SOURCE FILE HASH_SHA256 = sourcefile
      SOURCE ROW NUMBER = rownumber 
;
Example 2 - using configuration parameters

In this example the connection string includes configuration parameters that will do the following:

  • Limit the number of files to import concurrently to 2
  • Increase the number of buffers to fetch during the import to 4
  • Exclude columns 1, 3, 4, 5, 6, 7, 8, and 11 from the imported dataset
Copy
CREATE OR REPLACE CONNECTION my_s3_bucket
    TO 'https://my_s3_bucket.s3.eu-west-1.amazonaws.com;MaxConnections=2;MaxBatchFetchSize=4;SkipCols=1,3..8,11'
    USER 'my_user' IDENTIFIED BY 'my_secret_key'
;