UDF Scripts

This section describes how to use UDF scripts in Exasol.

Overview

UDF scripts allow you to program your own analysis, processing, and generation functions, and to execute these functions in parallel inside an Exasol cluster. By using UDF scripts you can solve problems that are not possible to solve with SQL statements.

With UDF scripts, you can implement the following extensions:

Programming languages

Exasol supports the programming languages Java, Lua, R, and Python in UDF scripts. These languages provide different functionalities (for example, statistical functions in R) and different libraries.

For more details about each programming language, see Details for Programming Languages.

The actual versions of the scripting languages can be listed with the corresponding metadata function.

Performance

For best performance, we recommend that you create a script using CREATE SCRIPT and then use this script within a SELECT statement. Embedding the script within SQL in this way provides the best performance and scalability.

From a performance perspective, which programming language you should use in an UDF script depends on the purpose and context of the script, since specific elements may have different capacities in each language. For example, string processing can be faster in one language while XML parsing can be faster in another. This means that one language cannot be said to have better performance in all circumstances. However, if overall performance is the most important criteria, we recommend using Lua. Lua is integrated in Exasol in the most native way, and therefore has the smallest process overhead.

During the processing of a SELECT statement, multiple virtual machines are started for each script and node. These virtual machines process the data independently. For scalar functions, the input rows are distributed across those virtual machines to achieve maximum parallelism. For SET input tuples, the virtual machines are used per group if you specify a GROUP BY clause. Otherwise there will be only one group, which means only one node and virtual machine can process the data.

Input and output values

In the CREATE SCRIPT command, you must define the type of input and output values. For example, you can create a script that generates multiple result rows out of a single input row (SCALAR ... EMITS).

Input values
SCALAR

Specifies that the script processes single input rows. The code is therefore called once per input row.

SET Specifies that the processing refers to a set of input values. Within the code, you can iterate through those values. For more information, see UDF Examples.
Output values
RETURNS Specifies that the script returns a single value.
EMITS

Specifies that the script can create (emit) multiple result rows (tuples).

If the input type is SET, the EMITS result can only be used alone, not combined with other expressions. However, you can nest it through a subselect to do further processing of those intermediate results.

You can define the data types of input and output parameters to specify the conversion between internal data types and the database SQL data types. If you do not specify the data types, the script has to handle that dynamically.

Each UDF script must contain the main function run(). This function is called with a parameter providing access to the input data of Exasol. If your script processes multiple input tuples (using SET), you can iterate through the single tuples using this parameter.

You can specify an ORDER BY clause either when creating a script (CREATE SCRIPT) or when calling it. This clause sorts the processing of the groups of SET input data. If it is necessary for the algorithm, you should specify this clause when creating the script to avoid wrong results due to misuse.

Input parameters in scripts are always case sensitive, similar to the script code. This is different to SQL identifiers, which are only case sensitive if they are delimited.

UDF Script Signature Generator

You can use the UDF Script Signature Generator to generate a simple dummy template for a UDF script by configuring details such as the language, input type, script name, parameters, and output type. The resulting SQL can be used as a template and as a starting point for developing UDFs. It contains the UDF signature, a language-specific run method, and an example on how to iterate over and access the input, and how to produce output.