UDF Scripts
This section provides an overview of how to use UDF scripts in Exasol.
Overview
UDF scripts allow you to program your own analysis, processing, and generation functions, and to execute these functions in parallel inside an Exasol cluster. By using UDF scripts you can solve problems that are not possible to solve with SQL statements.
With UDF scripts, you can implement the following extensions:
Programming languages
Exasol supports the programming languages Java, Lua, R, and Python in UDF scripts. These languages provide different functionalities (for example, statistical functions in R) and different libraries.
For more details about each programming language, see Details for Programming Languages.
The actual versions of the scripting languages can be listed with the corresponding metadata function.
Performance
For best performance, we recommend that you create a script using CREATE SCRIPT and then use this script within a SELECT statement. Embedding the script within SQL in this way provides the best performance and scalability.
From a performance perspective, which programming language you should use in an UDF script depends on the purpose and context of the script, since specific elements may have different capacities in each language. For example, string processing can be faster in one language while XML parsing can be faster in another. This means that one language cannot be said to have better performance in all circumstances. However, if overall performance is the most important criteria, we recommend using Lua. Lua is integrated in Exasol in the most native way, and therefore has the smallest process overhead.
During the processing of a SELECT
statement, multiple virtual machines are started
for each script and node. These virtual machines process the data independently.
For scalar functions, the input rows are distributed across those virtual machines to
achieve maximum parallelism. For SET
input tuples, the virtual machines are used
per group if you specify a GROUP BY
clause. Otherwise there will be only one group, which means only one node and virtual machine can process
the data.
Input and output values
In the CREATE SCRIPT command, you must define the type of input and output values. For example, you can create a script that generates multiple result rows out of a single input row (SCALAR ... EMITS
).
Input values | |
SCALAR |
Specifies that the script processes single input rows. The code is therefore called once per input row. |
SET
|
Specifies that the processing refers to a set of input values. Within the code, you can iterate through those values. For more information, see UDF Examples. |
Output values | |
RETURNS
|
Specifies that the script returns a single value. |
EMITS
|
Specifies that the script can create (emit) multiple result rows (tuples). If the input type is |
You can define the data types of input and output parameters to specify the conversion between internal data types and the database SQL data types. If you do not specify the data types, the script has to handle that dynamically.
Each UDF script must contain the main function run()
. This function is called with a parameter providing access to the
input data of Exasol. If your script processes multiple input tuples (using SET
), you can iterate through the
single tuples using this parameter.
You can specify
an ORDER BY
clause either when creating a script (CREATE SCRIPT
) or when calling it. This clause sorts the processing of the groups of SET
input
data. If it is necessary for the algorithm,
you should specify this clause when creating the script to avoid
wrong results due to misuse.
Input parameters in scripts are always case sensitive, similar to the script code. This is different to SQL identifiers, which are only case sensitive if they are delimited.
UDF Script Signature Generator
You can use the UDF Script Signature Generator to generate a simple dummy template for a UDF script by configuring details such as the language, input type, script name, parameters, and output type. The resulting SQL can be used as a template and as a starting point for developing UDFs. It contains the UDF signature, a language-specific run method, and an example on how to iterate over and access the input, and how to produce output.