Load data from Parquet files in Amazon S3 on AWS

This article explains how to load data from Parquet files in Amazon S3 buckets on AWS using IMPORT.

Introduction

In Exasol 2025.1 and later you can use the IMPORT command to load data from Parquet files in Amazon S3 buckets on AWS using the native bulk loader in Exasol. This allows multiple Parquet files to be imported in parallel across nodes, enabling higher throughput for larger datasets. Single Parquet files can be read using multiple parallel operations, which results in faster and more efficient data read operations.

This operation is only supported for files stored on Amazon S3 on AWS. Other cloud services that use an S3 compatible protocol are not supported. Exasol automatically recognizes an Amazon S3 import based on the URL that is used with the IMPORT statement.

Amazon S3 URL format support

The following Amazon S3 URL formats are supported:

URL format Example
https://<bucketname>.s3.amazonaws.com https://my_bucket.s3.amazonaws.com
https://<bucketname>.s3.<aws_region>.amazonaws.com https://my_bucket.s3.eu-west-1.amazonaws.com
s3://<bucketname> s3://my_bucket

The AWS region is automatically detected if omitted in the URL.

AWS authentication options

Authentication using access key ID and secret access key is supported for seamless access to AWS resources. Additional support for session tokens allows secure short-lived credential-based authentication for temporary access to AWS.

For more information about AWS authentication methods, see AWS Identity and Access Management.

We recommend that you always create a connection object instead of providing the authentication details directly in the IMPORT command. To learn more, see Create a connection.

Supported data types

The following Parquet data types are supported:

INT8, INT16, INT32, INT64, UINT8, UINT16, UINT32, UINT64, FLOAT, DOUBLE

Additional supported data types:

Decimal128, String, LargeString, Date32, Date64, Bool, Timestamp

Source data tracking

Additional SQL options enable row-origin tracking by appending SHA256 file hash and source row number as metadata during the import.

The target table must contain suitable columns to store the metadata.

SQL option Description Column type
SOURCE FILE HASH_SHA256 Allows secure mapping of rows to their respective file origins.

CHAR(64)

VARCHAR(n) with n>=64

HASHTYPE (256 BIT)

SOURCE ROW NUMBER Captures the position of the row in the source file (0-based index). DECIMAL(p,s)

Column mapping

This feature is only available in Exasol 2025.1.5 and later.

You can specify source column names using the WITH SOURCE COLUMN NAMES=() option in the IMPORT statement. This allows you to excplicitly map Parquet columns to columns in the target table.

Usage notes

  • Column names are case sensitive. If there is a case mismatch between the source/target column names, the target column defaults to NULL.

  • The columns in the source must be in the exact same order as the target columns.

  • Duplicate column names in the Parquet file are not allowed. The query will abort with error ETL-2231.

  • Columns in the source that are not mapped to a target column are ignored.

  • If a source column is missing, the corresponding target column defaults to NULL.

SkipCols cannot be used together with the SOURCE COLUMN NAMES option. If both are specified in the same query, the import operation fails with the exception ETL-2230.

Type compatibility

Automatic casting is done for compatible Parquet data types. For example, INT32 → INT64. Unsupported types will result in a runtime exception.

Target column type

Supported Parquet types

Unsupported types

Integer / Decimal

int8, int16, int32, int64, uint8, uint16, uint32, uint64, decimal128, boolean

float, double, string, timestamp, etc.

Double

int8, int16, int32, int64, uint8, uint16, uint32, uint64, float, double

decimal128, string, etc.

(Varchar, Char, Hash)

string, large_string

numeric, binary, bool, timestamp

Date

date32, date64

string, numeric, timestamp, etc.

Boolean

boolean

numeric, string, date, timestamp

Timestamp

timestamp, date32, date64

numeric, string, bool

Use cases

Scenario 1: Column mapping for diverse schemas

When working with datasets where source file schemas vary:

  • Map target table columns to specific Parquet columns using the WITH SOURCE COLUMN NAMES=() option.

  • Unmapped Parquet columns are ignored.

Scenario 2: Strict import schema enforcement

Ensure that all target table columns have corresponding mapped source columns for the import operation to succeed.

Best practices

  • Use the WITH SOURCE COLUMN NAMES=() option when loading from schema-diverse Parquet datasets to enforce mapping consistency.

  • Make sure that there is no case mismatch between the column names in the Parquet file and the target database table to avoid unexpected NULL values.

For an example of how to use column mapping, see Example 4 - Use explicit column mapping.

Column skipping

The SkipCols configuration parameter enables selective exclusion of columns from the imported Parquet data based on their index positions (0-based). This provides precise control over which columns to include or exclude while importing data. Skipped columns are defined as part of the connection string using ranges or discrete column indices.

SkipCols cannot be used together with the SOURCE COLUMN NAMES option. If both are specified in the same query, the import operation fails with the exception ETL-2230.

For more information and examples of use, see Configuration parameters and Examples.

Configuration parameters

Configuration parameters can be set to be query-specific using the connection string, and as global defaults using database-level parameters. If a parameter is set in the connection string, it will override the default database parameter.

Usage notes

  • Connection string parameters are case-insensitive.

  • Connection string parameters override database parameter values when both are provided.

  • Default values for connection string parameters are sourced from the database parameters if not expressly set.

  • If an exception is thrown for an invalid database parameter, the error message will be printed only on the first node.

Overview of configuration parameters

The following table is an overview of the available connection string parameters and the corresponding database parameters. For more details about each parameter, see Configuration parameter details.

Connection string parameter Database parameter Description
MaxConnections -etlParquetMaxConnections

Number of Parquet files imported in parallel on a single node.

MaxConcurrentReads -etlParquetMaxConcurrentReads

Number of parallel read operations per file.

Set to 1 for sequential reads. 0 is not allowed.

MaxBatchFetchSize -etlParquetMaxBatchFetchSize Maximum simultaneous buffers fetched during import.
MaxRows -etlParquetMaxRows Maximum number of rows to read in a batch.
MaxRetry -etlParquetMaxRetry Maximum number of attempts to connect.
DebugMode -etlParquetDebugMode Enable AWS structured logging.
- -etlLogLevel Additional database parameters to set logging levels when DebugMode is used. For more information, see Log levels.
- -traceLevel
SkipCols N/A

Skip specific column indexes during import.

Example: SkipCols=1,3..8,11 excludes columns 1, 3, 4, 5, 6, 7, 8, and 11. All other columns remain part of the imported dataset.

Configuration parameter details

Connection string parameter Validation Behavior when invalid or missing
MaxConnections Must be > 0

Exception is thrown.

MaxConcurrentReads Must be > 0

Exception is thrown.

MaxBatchFetchSize Must be > 0

Exception is thrown.

MaxRows Must be > 0

Exception is thrown.

MaxRetry Must be ≥ 0

Exception is thrown.

DebugMode Must be ≥ 0

Exception is thrown.

SkipCols

Only integers (0to 9), commas (,) to separate values, and ranges with two dots (..) are allowed.

Invalid characters: Exception is thrown and query is terminated.

A range must have exactly two integers, one before and one after the two dots (..). For example, 3..5 is valid, but 3.. or ..5 are invalid.

Incorrect format: Exception is thrown.

The starting value must be less than or equal to the ending value. For example, 3..5 is valid, but 5..3 is invalid.

Invalid range: Exception is thrown and query is terminated.
Database parameter Default value Validation Behavior when invalid or missing
-etlParquetMaxConnections 8 Must be > 0

Exception is thrown and database startup is aborted.

-etlParquetMaxConcurrentReads 3 Must be > 0

Exception is thrown and database startup is aborted.

-etlParquetMaxBatchFetchSize 1 Must be > 0

Exception is thrown and database startup is aborted.

-etlParquetMaxRows 4096 Must be > 0 Defaults to 4096.
-etlParquetMaxRetry 10 Must be ≥ 0 Defaults to 10.
-etlParquetDebugMode 0 (OFF) Must be ≥ 0 Debugging remains disabled.

-etlLogLevel

N/A Must be ≥ 2

If both parameters are not set to ≥ 2, the AWS logs will not appear even if -etlParquetDebugMode is enabled.

For more information, see Log levels.

-traceLevel N/A Must be ≥ 2

Log levels

The log level determines the granularity and amount of information logged by the system. Select a log level to tailor the logging output based on the use case, such as error debugging, performance monitoring, or high-level event tracking.

To enable structured logging for AWS, both -etlLogLevel and -traceLevel must be set to at least 2.

Value Log level Description

0

OFF (default)

Disables logging completely. No log messages are written.

1

FATAL

Logs only critical errors that cause the application or service to terminate.

2

ERROR

Logs errors that occur during execution but do not stop the application.

3

WARN

Logs events that are potentially harmful to the process but are recoverable.

4

INFO

Logs informational messages about the execution flow, such as successful initializations and key events.

5

DEBUG

Logs detailed debug information, typically for development or troubleshooting purposes.

≥ 6

TRACE

Logs highly detailed information, useful for deep debugging and issue resolution.

Create a connection

We recommend that you always create a connection object using CREATE CONNECTION and authenticate on AWS using the saved connection instead of providing authentication details directly in the IMPORT command.

Copy
CREATE OR REPLACE CONNECTION <my_connection>
    TO 'https://<bucketname>.s3[.<aws_region>].amazonaws.com'
    USER '<access_key>' IDENTIFIED BY '<secret_key>'
    [SESSION TOKEN '<session_token>']
;
Placeholder Description
<bucketname> The name of the Amazon S3 bucket
<aws_region> The AWS region, for example: eu-west-1
<access_key> AWS access key ID
<secret_key> AWS secret key
<session_token> AWS session token (optional)

Examples

Example 1 - Create schema and table with file tracking

In this example we create a connection to the S3 bucket, create a new schema and table, then load data from two Parquet files, using SQL options to store source metadata for row-origin tracking.

Copy
CREATE OR REPLACE CONNECTION my_s3_bucket
    TO 'https://my_s3_bucket.s3.eu-west-1.amazonaws.com'
    USER 'my_user' IDENTIFIED BY 'my_secret_key'
;

DROP SCHEMA IF EXISTS MY_SCHEMA CASCADE;
CREATE SCHEMA MY_SCHEMA;
CREATE TABLE MY_SCHEMA.MY_TABLE (sourcefile HASHTYPE (256 BIT), c1 INT, rownumber INT);

IMPORT INTO MY_SCHEMA.MY_TABLE
    FROM PARQUET AT my_s3_bucket 
    FILE 'nested/path/to/test_1.parquet' FILE 'nested/path/to/test_2.parquet' 
    WITH 
      SOURCE FILE HASH_SHA256 = sourcefile
      SOURCE ROW NUMBER = rownumber 
;
Example 2 - Using configuration parameters

In this example the connection string includes configuration parameters that will do the following:

  • Limit the number of files to import concurrently to 2
  • Increase the number of buffers to fetch during the import to 4
  • Exclude columns 1, 3, 4, 5, 6, 7, 8, and 11 from the imported dataset
Copy
CREATE OR REPLACE CONNECTION my_s3_bucket
    TO 'https://my_s3_bucket.s3.eu-west-1.amazonaws.com;MaxConnections=2;MaxBatchFetchSize=4;SkipCols=1,3..8,11'
    USER 'my_user' IDENTIFIED BY 'my_secret_key'
;
Example 3 - Set dynamic batch size in the connection string
Copy
CREATE OR REPLACE CONNECTION my_s3_bucket 
    TO 'https://my_s3_bucket.s3.eu-west-1.amazonaws.com/;MaxRows=2048' 
    USER 'my_user' IDENTIFIED BY 'my_secret_key'
;
Example 4 - Use explicit column mapping
Copy
CREATE OR REPLACE CONNECTION my_s3_bucket
  TO 'https://my_s3_bucket.s3.eu-west-1.amazonaws.com';

CREATE SCHEMA IF NOT EXISTS MY_SCHEMA;
CREATE TABLE MY_SCHEMA.MY_TABLE (
  c1 INT,
  name VARCHAR(256),
  mapped_column VARCHAR(256),
  sourcefile HASHTYPE (256 BIT),
  rownumber INT
);

IMPORT INTO MY_SCHEMA.MY_TABLE
  FROM PARQUET AT my_s3_bucket
  FILE 'file_1.parquet'
  FILE 'file_2.parquet'
  WITH
    SOURCE COLUMN NAMES = ('col1', 'col2', 'renamed_col')
    SOURCE FILE HASH_SHA256 = sourcefile
    SOURCE ROW NUMBER = rownumber;