Hadoop and Other Systems
Exasol provides a project Hadoop ETL UDFs that you can use this project to transfer data between Exasol and Hadoop cluster. This is an open source project officially supported by Exasol.
The SQL syntax used in the UDF is similar to Exasol's native IMPORT and EXPORT commands. However, the syntax contains some additional parameters to set the Hadoop properties.
The project currently supports the following:
- HCatalog Metadata (for example, table location, columns, partitions)
- Multiple file formats (for example, Parquet, ORC, or RCFile)
- Hadoop Distributed File System (HDFS) High Availability
- Partitions
- Parallelization
Prerequisites
The following are required to use the project and data transfer:
- Deployed Hadoop cluster with HDFS, Hive, and HCatalog services.
- Exasol data nodes can access Hive Metastore.
- Exasol cluster has connectivity to NameNode and DataNodes through native HDFS interface or WebHDFS interface.
For more information, see Deployment Document on GitHub.
Supported Hadoop Distributions
The currently supported Hadoop distributions are as follows:
- Apache Hadoop
- Cloudera Hadoop Distribution (CDH)
- Hortonworks Hadoop Distribution (HDP)
- AWS EMR
- GCP DataProc
Hadoop ETL UDF Script Deployment
Do the following for the deployment of scripts:
- Build the jar for a specific distribution.
- Upload the jar file to a bucket in BucketFS. To know about BucketFS, see BucketFS.
- Set up the ETL scripts using the steps mentioned in Deploying the Hadoop ETL UDFs.
Examples
Import
IMPORT INTO my_exasol_table FROM SCRIPT etl.import_hcat_table WITH
HCAT_DB = 'default'
HCAT_TABLE = 'my_hcat_table_1'
HCAT_ADDRESS = 'hcatalog-server:50111'
HDFS_USER = 'hdfs';
Export
EXPORT my_exasol_table INTO SCRIPT etl.export_hcat_table WITH
HCAT_DB = 'default'
HCAT_TABLE = 'my_hcat_table_2'
HCAT_ADDRESS = 'hcatalog-server:50111'
HDFS_USER = 'hdfs';
Contribute to the Project
Exasol encourages your contribution to the open source project. To know about how to contribute to the project, see Contributing.
Additional Reference
The Hadoop ETL UDFs project isn't the only way to transfer data between Hadoop clusters and Exasol. Exasol provides Virtual Schemas connectivity to access data using the Apache Hive and Apache Impala distributed query engines.