Hadoop and Other Systems

Exasol provides a project Hadoop ETL UDFs that you can use this project to transfer data between Exasol and Hadoop cluster. This is an open source project officially supported by Exasol.

The SQL syntax used in the UDF is similar to Exasol's native IMPORT and EXPORT commands. However, the syntax contains some additional parameters to set the Hadoop properties.

The project currently supports the following:

  • HCatalog Metadata (for example, table location, columns, partitions)
  • Multiple file formats (for example, Parquet, ORC, or RCFile)
  • Hadoop Distributed File System (HDFS) High Availability
  • Partitions
  • Parallelization

Prerequisites

The following are required to use the project and data transfer:

  • Deployed Hadoop cluster with HDFS, Hive, and HCatalog services.
  • Exasol data nodes can access Hive Metastore.
  • Exasol cluster has connectivity to NameNode and DataNodes through native HDFS interface or WebHDFS interface.

For more information, see Deployment Document on GitHub.

Supported Hadoop Distributions

The currently supported Hadoop distributions are as follows:

  • Apache Hadoop
  • Cloudera Hadoop Distribution (CDH)
  • Hortonworks Hadoop Distribution (HDP)
  • AWS EMR
  • GCP DataProc

Hadoop ETL UDF Script Deployment

Do the following for the deployment of scripts:

  1. Build the jar for a specific distribution.
  2. Upload the jar file to a bucket in BucketFS. To know about BucketFS, see BucketFS.
  3. Set up the ETL scripts using the steps mentioned in Deploying the Hadoop ETL UDFs.

Examples

Import

IMPORT INTO my_exasol_table FROM SCRIPT etl.import_hcat_table WITH
   HCAT_DB      = 'default' 
   HCAT_TABLE   = 'my_hcat_table_1' 
   HCAT_ADDRESS = 'hcatalog-server:50111' 
   HDFS_USER    = 'hdfs';

Export

EXPORT my_exasol_table INTO SCRIPT etl.export_hcat_table WITH 
   HCAT_DB      = 'default' 
   HCAT_TABLE   = 'my_hcat_table_2' 
   HCAT_ADDRESS = 'hcatalog-server:50111' 
   HDFS_USER    = 'hdfs';

Contribute to the Project

Exasol encourages your contribution to the open source project. To know about how to contribute to the project, see Contributing.

Additional Reference

The Hadoop ETL UDFs project isn't the only way to transfer data between Hadoop clusters and Exasol. Exasol provides Virtual Schemas connectivity to access data using the Apache Hive and Apache Impala distributed query engines.