UDF Scripts

This section explains how to use UDF scripts in Exasol.

UDF scripts provide you the ability to program your own analysis, processing, or generation functions, and to execute them in parallel inside Exasol's high performance cluster (In Database Analytics). By using UDF scripts you can solve problems that are not possible to solve with SQL statements. Through UDF scripts, you get a highly flexible interface for implementing nearly every requirement, and will become an HPC developer without the need for previous knowledge.

With UDF scripts, you can implement the following extensions:

To get the best performance of the UDF scripts, you need to create a script (see CREATE SCRIPT) and use this script within a SELECT statement. This embedding within SQL gives the best performance and scalability. Exasol supports multiple programming languages (Java, Lua, R, and Python). These languages provide advantages according to their own areas (for example, statistical functions in R) and different libraries (XML parser and so on). Details about these programming languages are available in Details for Programming Languages.

The actual versions of the scripting languages can be listed with the corresponding metadata function.

In the CREATE SCRIPT command, you need to define the type of input and output values. For example, you can create a script which generates multiple result rows out of a single input row (SCALAR ... EMITS).

Input Values
SCALAR

The keyword SCALAR specifies that the script processes single input rows. It's code is therefore called once per input row.

SET If you define the option SET, then the processing refers to a set of input values. Within the code, you can iterate through those values (for more information, see UDF Examples).
Output Values
RETURNS In this case the script returns a single value.
EMITS

If the keyword EMITS was defined, the script can create (emit) multiple result rows (tuples). In case of input type SET, the EMITS result can only be used alone, thus not be combined with other expressions. However, you can nest it through a subselect to do further processing of those intermediate results.

The data types of input and output parameters can be defined to specify the conversion between internal data types and the database SQL data types. If you don't specify them, the script has to handle that dynamically (see details and examples below).

The input parameters of scripts are always treated case-sensitive, similar to the script code. This is different to SQL identifiers which are only treated case-sensitive when being delimited.

Scripts must contain the main function run(). This function is called with a parameter providing access to the input data of Exasol. If your script processes multiple input tuples (thus a SET script), you can iterate through the single tuples by the use of this parameter.

UDF Script Signature Generator

You can use the UDF Script Signature Generator to generate a simple dummy template for a UDF script by configuring details such as the language, input type, the script name, parameters, and output type.

The resulting SQL can be used as a template and as a starting point for developing UDFs. It contains the UDF signature, a language-specific run method, and an example on how to iterate over and access the input, and how to produce output.

Additional Information for UDF Scripts

Topic

Details

Internal processing

During the processing of a SELECT statement, multiple virtual machines are started for each script and node. These virtual machines process the data independently.

For scalar functions, the input rows are distributed across those virtual machines to achieve maximum parallelism.

For SET input tuples, the virtual machines are used per group if you specify a GROUP BY clause. Otherwise there will be only one group, which means only one node and virtual machine can process the data.

ORDER BY clause

You can specify an ORDER BY clause while creating a script (CREATE SCRIPT) or while calling it. This sorts the processing of the groups of SET input data.

If it is necessary for the algorithm, you should specify this clause while creating the script to avoid wrong results due to misuse.

Performance comparison between the programming languages

The performance of the different languages cannot be compared, since the specific elements of the languages can have different capacities. For example, a string processing can be faster in one language and the XML parsing can be faster in the other one.

However, Exasol recommends use of Lua if performance is the most important criteria. Lua is integrated in Exasol in the most native way and therefore has the smallest process overhead.