Importing Parquet files
Learn how to load data from Parquet files stored locally, on remote file servers, or in the cloud.
You can use the IMPORT command to load data from Parquet files that are stored on cloud storage services, on remote file servers using FTP/FTPS, SFTP, and HTTP/HTTPS, and on local systems. Exasol supports the Apache Iceberg open table format in Parquet files.
Multiple Parquet files can be imported in parallel across all cluster nodes, enabling higher throughput for larger datasets. Single Parquet files can be read using multiple parallel operations for faster and more efficient data read operations.
In Exasol 2026.1 and later you can use glob patterns to import multiple Parquet files dynamically and recursively from S3-compatible storage. To learn more, see Load files using pattern matching (S3).
This article describes how to import Parquet files from cloud storage. To learn how to import Parquet files stored on local or remote systems, see Load data from local or remote files.
All Parquet import features are fully supported in Exasol 2026.1.0 and later, with partial support in earlier versions. For more details about which features are supported in your Exasol version, see the Release notes.
Connect to storage
This article describes the available features and considerations when importing Parquet files. To learn how to connect to your storage service or local storage, see the following articles:
Supported data types
The following Parquet data types are supported:
INT8, INT16, INT32, INT64, UINT8, UINT16, UINT32, UINT64, FLOAT, DOUBLE
Additional supported data types:
Decimal128, String, LargeString, Date32, Date64, Bool, Timestamp
Source data tracking
Additional SQL options enable row-origin tracking by appending SHA256 file hash and source row number as metadata during the import.
The target table must contain suitable columns to store the metadata.
| SQL option | Description | Column type |
|---|---|---|
SOURCE FILE HASH_SHA256
|
Allows secure mapping of rows to their respective file origins. |
|
SOURCE ROW NUMBER
|
Captures the position of the row in the source file (0-based index). | DECIMAL(p,s)
|
Column mapping
This feature is available in Exasol 2025.1.5 and later.
You can specify source column names by using WITH SOURCE COLUMN NAMES=() in the IMPORT statement. This allows you to explicitly map Parquet columns to columns in the target table.
Usage notes
-
Column names are case sensitive. If there is a case mismatch between the source/target column names, the target column defaults to NULL.
-
The columns in the source must be in the exact same order as the target columns.
-
Duplicate column names in the Parquet file are not allowed. The query will abort with error
ETL-2231. -
Columns in the source that are not mapped to a target column are ignored.
-
If a source column is missing, the corresponding target column defaults to NULL.
SkipCols cannot be used together with the SOURCE COLUMN NAMES option. If both are specified in the same query, the import operation fails with the exception ETL-2230.
Type compatibility
Automatic casting is done for compatible Parquet data types. For example, INT32 → INT64. Unsupported types will result in a runtime exception.
|
Target column type |
Supported Parquet types |
Unsupported types |
|---|---|---|
|
Integer / Decimal |
int8, int16, int32, int64, uint8, uint16, uint32, uint64, decimal128, boolean |
float, double, string, timestamp, etc. |
|
Double |
int8, int16, int32, int64, uint8, uint16, uint32, uint64, float, double |
decimal128, string, etc. |
|
(Varchar, Char, Hash) |
string, large_string |
numeric, binary, bool, timestamp |
|
Date |
date32, date64 |
string, numeric, timestamp, etc. |
|
Boolean |
boolean |
numeric, string, date, timestamp |
|
Timestamp |
timestamp, date32, date64 |
numeric, string, bool |
Usage
Scenario 1: Column mapping for diverse schemas
When working with datasets where source file schemas vary:
-
Map target table columns to specific Parquet columns using the
WITH SOURCE COLUMN NAMES=()option. -
Unmapped Parquet columns are ignored.
Scenario 2: Strict import schema enforcement
-
Ensure that all target table columns have corresponding mapped source columns for the import operation to succeed.
Best practices
-
Use the
WITH SOURCE COLUMN NAMES=()option when loading from schema-diverse Parquet datasets to enforce mapping consistency. -
Make sure that there is no case mismatch between the column names in the Parquet file and the target database table to avoid unexpected NULL values.
For an example of how to use column mapping, see Example 3 - Use explicit column mapping.
Column skipping
The SkipCols configuration parameter enables selective exclusion of columns from the imported Parquet data based on their index positions (0-based). This provides precise control over which columns to include or exclude while importing data. Skipped columns are defined as part of the connection string using ranges or discrete column indices.
SkipCols cannot be used together with the SOURCE COLUMN NAMES option. If both are specified in the same query, the import operation fails with the exception ETL-2230.
For more information and examples of use, see Configuration parameters and Examples.
Load files using pattern matching (S3)
This feature is available in Exasol 2026.1 and later.
Pattern matching is only supported when importing Parquet files from S3 storage.
The FILE parameter in IMPORT FROM PARQUET supports glob patterns when accessing data from S3-compatible storage systems. This allows importing multiple files without specifying each file explicitly. Glob patterns are interpreted in the file path and enable flexible selection of files across directories and subdirectories.
Supported patterns
|
Pattern |
Description |
|---|---|
|
|
Matches zero or more characters |
|
|
Matches exactly one character |
|
|
Matches one character from the specified set |
|
|
Matches one character not in the specified set |
|
|
Matches directories recursively |
The characters *, ?, [, ] are interpreted as glob pattern operators if they are used in the file path. To match them literally, escape them using a backslash /
Examples
Import all Parquet files from the bucket root:
IMPORT INTO my_table
FROM PARQUET
AT 's3://bucket/'
USER '...'
IDENTIFIED BY '...'
FILE '*.parquet';
Import recursively from a directory and its subdirectories:
...
FILE 'data/**/*.parquet';
Import files from a directory (non-recursive):
...
FILE 'data/*';
Match single character:
...
FILE 'data/file?.parquet';
Match a character set:
...
FILE 'data/file[abc].parquet';
Exclude characters:
...
FILE 'data/file[!abc].parquet';
Usage notes
-
Glob patterns are supported only in the
FILEclause. -
The bucket name in the
ATclause must be specified explicitly and does not support glob patterns. -
Glob patterns are supported only for S3-compatible storage systems.
-
Recursive matching is only enabled when using the exact syntax
/**/. For example:-
data/**/*.parquet→ recursive -
data**/*.parquet→ not recursive
-
Error handling
-
If no files match the specified pattern, the query fails.
-
If any matched file is not a valid Parquet file or has an incompatible schema, the entire import fails.
-
If multiple patterns are specified, each pattern must match at least one file.
-
Glob patterns are not allowed in the connection definition (
ATclause). -
Using glob patterns with unsupported storage systems results in an error.
Permissions
The user must have permission to use the s3:ListBucket action in AWS. Without this permission, the import fails. For more information, refer to the Amazon AWS documentation.
Configuration parameters
Configuration parameters can be set to be query-specific using the connection string, and as global defaults using database parameters. If a parameter is set in the connection string, it will override the default database parameter.
Usage notes
-
Connection string parameters are case-insensitive.
-
Connection string parameters override database parameter values when both are provided.
-
Default values are sourced from the database parameters if not expressly set.
| Parameter | Description | Validation |
|---|---|---|
MaxConnections
|
Number of Parquet files imported in parallel on a single node. |
Must be > 0 |
MaxConcurrentReads
|
Number of parallel read operations per file. Set to
Set to |
Must be > 0 |
MaxBatchFetchSize
|
Maximum simultaneous buffers fetched during import. | Must be > 0 |
MaxRows
|
Maximum number of rows to read in a batch. | Must be > 0 |
MaxRetry
|
Maximum number of attempts to connect. | Must be ≥ 0 |
SkipCols
|
Skip specific column indexes during import. Example: |
Only integers (
A range must have exactly two integers, one before and one after the two dots (
The starting value must be less than or equal to the ending value. For example, |
| AWS S3 and S3-compatible storage only | ||
DebugMode
|
Enable AWS structured logging. | Must be ≥ 0 |
EndPoint
|
URL when connecting to S3-compatible storage service (not AWS) | - |
UseProxy
|
Boolean, specifies if the proxy specified in the
|
- |
Log levels (AWS)
The log level determines the granularity and amount of information logged by the system. Select a log level to tailor the logging output based on the use case, such as error debugging, performance monitoring, or high-level event tracking.
To enable structured logging for AWS, the database parameters -etlLogLevel and -traceLevel must both be set to at least 2.
| Value | Log level | Description |
|---|---|---|
|
0 |
OFF (default) |
Disables logging completely. No log messages are written. |
|
1 |
FATAL |
Logs only critical errors that cause the application or service to terminate. |
|
2 |
ERROR |
Logs errors that occur during execution but do not stop the application. |
|
3 |
WARN |
Logs events that are potentially harmful to the process but are recoverable. |
|
4 |
INFO |
Logs informational messages about the execution flow, such as successful initializations and key events. |
|
5 |
DEBUG |
Logs detailed debug information, typically for development or troubleshooting purposes. |
|
≥ 6 |
TRACE |
Logs highly detailed information, useful for deep debugging and issue resolution. |
Schema inference
This feature is supported in Exasol 2026.1 and later.
The schema inference feature allows you to import Parquet files without first creating a target table in the database. The loader will automatically create the table if it does not exist. The syntax is:
CREATE OR REPLACE TABLE <table_name> AS (IMPORT FROM PARQUET <standard parquet import statement options>)
If the target table already exists, an exception is thrown and the import job is aborted.
Column mapping is not supported with schema inference.
Data type mapping
The data type of the target table is derived from the Parquet data type using the following mapping:
| Parquet data type | Exasol data type |
|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Precision and scale
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Example:
CREATE OR REPLACE TABLE my_schema.my_table
AS (
IMPORT FROM PARQUET AT 'my_s3_bucket.s3.eu-west-1.amazonaws.com;MaxConnections=2;MaxBatchFetchSize=4;SkipCols=1,3..8,11''
USER 'my_user'
IDENTIFIED BY 'my_secret_key' FILE 'dup2.parquet'
)
;
Examples
Exaple 1: Create a connection object
-- Amazon S3 connection
CREATE OR REPLACE CONNECTION my_conn
TO 'https://my_s3_bucket.s3.eu-west-1.amazonaws.com'
USER 'my_access_key' IDENTIFIED BY 'my_secret_key'
;
-- S3-compatible storage
CREATE OR REPLACE CONNECTION my_conn
TO 's3://my_bucket;EndPoint=https://example.com:9000'
USER 'my_user' IDENTIFIED BY 'my_secret'
;
-- Azure Blob Storage
CREATE CONNECTION my_conn
TO 'DefaultEndpointsProtocol=https;EndpointSuffix=core.windows.net'
USER 'my_account_name' IDENTIFIED BY 'my_account_key'
;
Example 2 - Create schema and table with file tracking
In this example we connect using a stored connection (see examples above), create a new schema and table, then load data from two Parquet files, using SQL options to store source metadata for row-origin tracking.
CREATE SCHEMA IF NOT EXISTS MY_SCHEMA;
CREATE TABLE MY_SCHEMA.MY_TABLE (sourcefile HASHTYPE (256 BIT), c1 INT, rownumber INT);
IMPORT INTO MY_SCHEMA.MY_TABLE
FROM PARQUET AT my_conn
FILE 'nested/path/to/test_1.parquet' FILE 'nested/path/to/test_2.parquet'
WITH
SOURCE FILE HASH_SHA256 = sourcefile
SOURCE ROW NUMBER = rownumber
;
Example 3 - Use explicit column mapping
CREATE SCHEMA IF NOT EXISTS MY_SCHEMA;
CREATE TABLE MY_SCHEMA.MY_TABLE (
c1 INT,
name VARCHAR(256),
mapped_column VARCHAR(256),
sourcefile HASHTYPE (256 BIT),
rownumber INT
);
IMPORT INTO MY_SCHEMA.MY_TABLE
FROM PARQUET AT my_s3_bucket
FILE 'file_1.parquet'
FILE 'file_2.parquet'
WITH
SOURCE COLUMN NAMES = ('col1', 'col2', 'renamed_col')
SOURCE FILE HASH_SHA256 = sourcefile
SOURCE ROW NUMBER = rownumber
;
Example 4 - Use configuration parameters in connection string
The following examples show a connection to AWS. The procedure is identical for all storage types.
-- Limit the number of files to import concurrently to 2
-- Increase the number of buffers to fetch during the import to 4
-- Exclude columns 1, 3, 4, 5, 6, 7, 8, and 11 from the imported dataset
CREATE OR REPLACE CONNECTION my_conn
TO 'https://my_s3_bucket.s3.eu-west-1.amazonaws.com;MaxConnections=2;MaxBatchFetchSize=4;SkipCols=1,3..8,11'
USER 'my_access_key' IDENTIFIED BY 'my_secret_key'
;
-- Set a dynamic batch size
CREATE OR REPLACE CONNECTION my_conn
TO 'https://my_s3_bucket.s3.eu-west-1.amazonaws.com/;MaxRows=2048'
USER 'my_access_key' IDENTIFIED BY 'my_secret_key'
;