Load data from Parquet files in Amazon S3 on AWS
This article explains how to load data from Parquet files in Amazon S3 buckets on AWS using IMPORT.
Introduction
In Exasol 2025.1 and later you can use the IMPORT command to load data from Parquet files in Amazon S3 buckets on AWS using the native bulk loader in Exasol. This allows multiple Parquet files to be imported in parallel across nodes, enabling higher throughput for larger datasets. Single Parquet files can be read using multiple parallel operations, which results in faster and more efficient data read operations.
This operation is only supported for files stored on Amazon S3 on AWS. Other cloud services that use an S3 compatible protocol are not supported. Exasol automatically recognizes an Amazon S3 import based on the URL that is used with the IMPORT statement.
Amazon S3 URL format support
The following Amazon S3 URL formats are supported:
| URL format | Example |
|---|---|
| https://<bucketname>.s3.amazonaws.com | https://my_bucket.s3.amazonaws.com |
| https://<bucketname>.s3.<aws_region>.amazonaws.com | https://my_bucket.s3.eu-west-1.amazonaws.com |
| s3://<bucketname> | s3://my_bucket |
The AWS region is automatically detected if omitted in the URL.
AWS authentication options
Authentication using access key ID and secret access key is supported for seamless access to AWS resources. Additional support for session tokens allows secure short-lived credential-based authentication for temporary access to AWS.
For more information about AWS authentication methods, see AWS Identity and Access Management.
We recommend that you always create a connection object instead of providing the authentication details directly in the IMPORT command. To learn more, see Create a connection.
Supported data types
The following Parquet data types are supported:
INT8, INT16, INT32, INT64, UINT8, UINT16, UINT32, UINT64, FLOAT, DOUBLE
Additional supported data types:
Decimal128, String, LargeString, Date32, Date64, Bool, Timestamp
Source data tracking
Additional SQL options enable row-origin tracking by appending SHA256 file hash and source row number as metadata during the import.
The target table must contain suitable columns to store the metadata.
| SQL option | Description | Column type |
|---|---|---|
SOURCE FILE HASH_SHA256
|
Allows secure mapping of rows to their respective file origins. |
|
SOURCE ROW NUMBER
|
Captures the position of the row in the source file (0-based index). | DECIMAL(p,s)
|
Column mapping
This feature is only available in Exasol 2025.1.5 and later.
You can specify source column names using the WITH SOURCE COLUMN NAMES=() option in the IMPORT statement. This allows you to excplicitly map Parquet columns to columns in the target table.
Usage notes
-
Column names are case sensitive. If there is a case mismatch between the source/target column names, the target column defaults to NULL.
-
The columns in the source must be in the exact same order as the target columns.
-
Duplicate column names in the Parquet file are not allowed. The query will abort with error
ETL-2231. -
Columns in the source that are not mapped to a target column are ignored.
-
If a source column is missing, the corresponding target column defaults to NULL.
SkipCols cannot be used together with the SOURCE COLUMN NAMES option. If both are specified in the same query, the import operation fails with the exception ETL-2230.
Type compatibility
Automatic casting is done for compatible Parquet data types. For example, INT32 → INT64. Unsupported types will result in a runtime exception.
|
Target column type |
Supported Parquet types |
Unsupported types |
|---|---|---|
|
Integer / Decimal |
int8, int16, int32, int64, uint8, uint16, uint32, uint64, decimal128, boolean |
float, double, string, timestamp, etc. |
|
Double |
int8, int16, int32, int64, uint8, uint16, uint32, uint64, float, double |
decimal128, string, etc. |
|
(Varchar, Char, Hash) |
string, large_string |
numeric, binary, bool, timestamp |
|
Date |
date32, date64 |
string, numeric, timestamp, etc. |
|
Boolean |
boolean |
numeric, string, date, timestamp |
|
Timestamp |
timestamp, date32, date64 |
numeric, string, bool |
Use cases
Scenario 1: Column mapping for diverse schemas
When working with datasets where source file schemas vary:
-
Map target table columns to specific Parquet columns using the
WITH SOURCE COLUMN NAMES=()option. -
Unmapped Parquet columns are ignored.
Scenario 2: Strict import schema enforcement
Ensure that all target table columns have corresponding mapped source columns for the import operation to succeed.
Best practices
-
Use the
WITH SOURCE COLUMN NAMES=()option when loading from schema-diverse Parquet datasets to enforce mapping consistency. -
Make sure that there is no case mismatch between the column names in the Parquet file and the target database table to avoid unexpected NULL values.
For an example of how to use column mapping, see Example 4 - Use explicit column mapping.
Column skipping
The SkipCols configuration parameter enables selective exclusion of columns from the imported Parquet data based on their index positions (0-based). This provides precise control over which columns to include or exclude while importing data. Skipped columns are defined as part of the connection string using ranges or discrete column indices.
SkipCols cannot be used together with the SOURCE COLUMN NAMES option. If both are specified in the same query, the import operation fails with the exception ETL-2230.
For more information and examples of use, see Configuration parameters and Examples.
Configuration parameters
Configuration parameters can be set to be query-specific using the connection string, and as global defaults using database-level parameters. If a parameter is set in the connection string, it will override the default database parameter.
Usage notes
-
Connection string parameters are case-insensitive.
-
Connection string parameters override database parameter values when both are provided.
-
Default values for connection string parameters are sourced from the database parameters if not expressly set.
-
If an exception is thrown for an invalid database parameter, the error message will be printed only on the first node.
Overview of configuration parameters
The following table is an overview of the available connection string parameters and the corresponding database parameters. For more details about each parameter, see Configuration parameter details.
| Connection string parameter | Database parameter | Description |
|---|---|---|
MaxConnections
|
-etlParquetMaxConnections
|
Number of Parquet files imported in parallel on a single node. |
MaxConcurrentReads
|
-etlParquetMaxConcurrentReads
|
Number of parallel read operations per file. Set to |
MaxBatchFetchSize
|
-etlParquetMaxBatchFetchSize
|
Maximum simultaneous buffers fetched during import. |
MaxRows
|
-etlParquetMaxRows
|
Maximum number of rows to read in a batch. |
MaxRetry
|
-etlParquetMaxRetry
|
Maximum number of attempts to connect. |
DebugMode
|
-etlParquetDebugMode
|
Enable AWS structured logging. |
| - | -etlLogLevel
|
Additional database parameters to set logging levels when DebugMode is used. For more information, see Log levels. |
| - | -traceLevel
|
|
SkipCols
|
N/A |
Skip specific column indexes during import. Example: |
Configuration parameter details
| Database parameter | Default value | Validation | Behavior when invalid or missing |
|---|---|---|---|
-etlParquetMaxConnections
|
8 | Must be > 0 |
Exception is thrown and database startup is aborted. |
-etlParquetMaxConcurrentReads
|
3 | Must be > 0 |
Exception is thrown and database startup is aborted. |
-etlParquetMaxBatchFetchSize
|
1 | Must be > 0 |
Exception is thrown and database startup is aborted. |
-etlParquetMaxRows
|
4096 | Must be > 0 | Defaults to 4096. |
-etlParquetMaxRetry
|
10 | Must be ≥ 0 | Defaults to 10. |
-etlParquetDebugMode
|
0 (OFF) | Must be ≥ 0 | Debugging remains disabled. |
| N/A | Must be ≥ 2 |
If both parameters are not set to ≥ 2, the AWS logs will not appear even if For more information, see Log levels. |
|
-traceLevel
|
N/A | Must be ≥ 2 |
Log levels
The log level determines the granularity and amount of information logged by the system. Select a log level to tailor the logging output based on the use case, such as error debugging, performance monitoring, or high-level event tracking.
To enable structured logging for AWS, both -etlLogLevel and -traceLevel must be set to at least 2.
| Value | Log level | Description |
|---|---|---|
|
0 |
OFF (default) |
Disables logging completely. No log messages are written. |
|
1 |
FATAL |
Logs only critical errors that cause the application or service to terminate. |
|
2 |
ERROR |
Logs errors that occur during execution but do not stop the application. |
|
3 |
WARN |
Logs events that are potentially harmful to the process but are recoverable. |
|
4 |
INFO |
Logs informational messages about the execution flow, such as successful initializations and key events. |
|
5 |
DEBUG |
Logs detailed debug information, typically for development or troubleshooting purposes. |
|
≥ 6 |
TRACE |
Logs highly detailed information, useful for deep debugging and issue resolution. |
Create a connection
We recommend that you always create a connection object using CREATE CONNECTION and authenticate on AWS using the saved connection instead of providing authentication details directly in the IMPORT command.
CREATE OR REPLACE CONNECTION <my_connection>
TO 'https://<bucketname>.s3[.<aws_region>].amazonaws.com'
USER '<access_key>' IDENTIFIED BY '<secret_key>'
[SESSION TOKEN '<session_token>']
;
| Placeholder | Description |
|---|---|
<bucketname>
|
The name of the Amazon S3 bucket |
<aws_region>
|
The AWS region, for example: eu-west-1 |
<access_key>
|
AWS access key ID |
<secret_key>
|
AWS secret key |
<session_token>
|
AWS session token (optional) |
Examples
Example 1 - Create schema and table with file tracking
In this example we create a connection to the S3 bucket, create a new schema and table, then load data from two Parquet files, using SQL options to store source metadata for row-origin tracking.
CREATE OR REPLACE CONNECTION my_s3_bucket
TO 'https://my_s3_bucket.s3.eu-west-1.amazonaws.com'
USER 'my_user' IDENTIFIED BY 'my_secret_key'
;
DROP SCHEMA IF EXISTS MY_SCHEMA CASCADE;
CREATE SCHEMA MY_SCHEMA;
CREATE TABLE MY_SCHEMA.MY_TABLE (sourcefile HASHTYPE (256 BIT), c1 INT, rownumber INT);
IMPORT INTO MY_SCHEMA.MY_TABLE
FROM PARQUET AT my_s3_bucket
FILE 'nested/path/to/test_1.parquet' FILE 'nested/path/to/test_2.parquet'
WITH
SOURCE FILE HASH_SHA256 = sourcefile
SOURCE ROW NUMBER = rownumber
;
Example 2 - Using configuration parameters
In this example the connection string includes configuration parameters that will do the following:
- Limit the number of files to import concurrently to 2
- Increase the number of buffers to fetch during the import to 4
- Exclude columns 1, 3, 4, 5, 6, 7, 8, and 11 from the imported dataset
CREATE OR REPLACE CONNECTION my_s3_bucket
TO 'https://my_s3_bucket.s3.eu-west-1.amazonaws.com;MaxConnections=2;MaxBatchFetchSize=4;SkipCols=1,3..8,11'
USER 'my_user' IDENTIFIED BY 'my_secret_key'
;
Example 3 - Set dynamic batch size in the connection string
CREATE OR REPLACE CONNECTION my_s3_bucket
TO 'https://my_s3_bucket.s3.eu-west-1.amazonaws.com/;MaxRows=2048'
USER 'my_user' IDENTIFIED BY 'my_secret_key'
;
Example 4 - Use explicit column mapping
CREATE OR REPLACE CONNECTION my_s3_bucket
TO 'https://my_s3_bucket.s3.eu-west-1.amazonaws.com';
CREATE SCHEMA IF NOT EXISTS MY_SCHEMA;
CREATE TABLE MY_SCHEMA.MY_TABLE (
c1 INT,
name VARCHAR(256),
mapped_column VARCHAR(256),
sourcefile HASHTYPE (256 BIT),
rownumber INT
);
IMPORT INTO MY_SCHEMA.MY_TABLE
FROM PARQUET AT my_s3_bucket
FILE 'file_1.parquet'
FILE 'file_2.parquet'
WITH
SOURCE COLUMN NAMES = ('col1', 'col2', 'renamed_col')
SOURCE FILE HASH_SHA256 = sourcefile
SOURCE ROW NUMBER = rownumber;