If the standard interfaces such as files, JDBC sources, or native Oracle and Exasol connections are not sufficient for complete ETL scope, you can develop user-defined ETL processes with help of UDF Scripts. This section describes the flexibility of the UDF to integrate any system and data format with Exasol (for example, Hadoop).
The concept is based on creation of a row-emitting script (with return type EMIT) by using the CREATE SCRIPT statement. If you integrate such script in an INSERT INTO <table> (SELECT my_emit_script('param') [FROM ...]) command, this UDF script is called on cluster during the execution of the SQL statement and the emitted rows are inserted into the target tables.
You can insert all possible processes in the script, such as, establishing connections to external systems, extracting the data, data type conversion, and even more complex transformations. Through parameters, you can hand over all required information to the script.
The advantage of this architecture is that you can integrate different types of libraries from different scripting languages to perform specific functions. For example, if you use libraries from a script language to take care of data formats, you can use new data formats from Hadoop systems within Exasol, without waiting for the next software releases.
You can also leverage the Dynamic Input and Output Parameters to develop generic scripts that you can reuse for all types of tables with different structures and data types.
Using UDFs to Load Data in Parallel
To achieve the optimum loading performance, you can parallelize ETL processes using UDF scripts by specifying the GROUP BY clause. The UDF script is called in parallel for every single group, exactly as it is in the normal usage of UDF scripts within plain SELECT statements.
The following example explains how to load data in parallel with the help of UDF scripts. The inner SELECT invokes a UDF script (once) which connects to a service and returns the list of existing JSON files, plus a certain partition number. The outer SELECTcalls another UDF script (with input type SET) that finally loads that data in parallel, according to partition ID.
INSERT INTO my_table ( SELECT load_my_json(service_address, filepath) EMITS (revenues DECIMAL) FROM (SELECT get_json_files('my-service-address', 10)) GROUP BY partition);
IMPORT FROM SCRIPT Syntax
As a precondition, the UDF script implements a specific method that creates a corresponding SELECT statement string out of the IMPORT information. In a normal scenario, the same UDF script will be called in the SELECT that invokes the actual loading. For more information about the method and other scripting languages, see Details for Programming Languages .
The internal execution consists of two phases, the SQL text generation and the subsequent embedding into an INSERT INTO SELECT statement. Exasol handles the second phase by adding the target table name and columns.
The following example shows the concept of the syntax. The user-defined IMPORT generates a sequence of integer values, the size of the integer is defined through a parameter.
--/ CREATE PYTHON SCALAR SCRIPT etl.generate_seq (num INT) EMITS (val INT) AS def generate_sql_for_import_spec(import_spec): if not "RECORDS" in import_spec.parameters: raise ValueError('Mandatory parameter RECORDS is missing') return "SELECT ETL.GENERATE_SEQ("+import_spec.parameters["RECORDS"]+")" def run(ctx): for i in range(1, ctx["num"]+1): ctx.emit(i) / IMPORT INTO (val int) FROM SCRIPT etl.generate_seq WITH records='5'; VAL ------- 1 2 3 4 5
In the UDF script generate_seq, the method generate_sql_for_import_spec(import_spec) is implemented that creates a simple SELECT statement and forwards the parameter RECORDS. Further, in the same UDF script, a run() method is implemented that handles the actual logic of the data generation. In general, the UDF script specified in the IMPORT command can create the SQL text (generate_sql_for_import_spec(import_spec)) and call a different UDF script.
The call of UDF scripts within the SELECT has to be schema-qualified.
The example also shows how the IMPORT parameter is handed over to the script call within the SELECT. The object import_spec provides information to the script, for example, parameters WITH clause, the used connection, or the column names of the target table of the IMPORT statement.
You can use this information to dynamically control the appropriate SQL text generation and the interaction with the external system. The specific name of the metadata object and its content may vary across the scripting languages. For details, see Details for Programming Languages.
The advantages of the IMPORT FROM SCRIPT statement are as follows:
- IMPORT remains the central command for loading data into Exasol.
- The user interface is simple and intuitive. You just have to define the parameters required for your IMPORT. This hides the complexity of the SQL statements, which can become complex for real-world scenarios.
- It supports named parameters, which makes it easier to work with mandatory and optional parameters and improves the readability.
- Like IMPORT, the IMPORT FROM SCRIPT command can be used as subselect (and hence in views) which allows you to embed the data access.
Similar to the concept of user-defined IMPORT processes, you can implement user-defined EXPORT processes. You have to create a UDF script that establishes a connection to the external system within the script code and transfers the data afterward. A simple SELECT my_export_script(...) FROM <table> can send all table rows to an external system. It is recommended to use a SET script and control the parallelism through a GROUP BY clause to minimize the overhead of establishing connections.
SELECT my_export_script('<service_address>', filepath) FROM my_table GROUP BY partition;
EXPORT INTO SCRIPT Syntax
As a precondition, the UDF script implements a specific method that creates a corresponding SQL string using the export information. For more information about the method and other scripting languages, see Details for Programming Languages.