UDF overview
This article provides an overview of how user defined functions (UDFs) work.
Introduction
User defined functions (UDFs) in Exasol allow you to program your own analysis, processing, and generation functions and to execute these functions in parallel inside an Exasol cluster. By using UDFs you can solve problems that are not possible to solve using only SQL statements.
What is the difference between UDFs and scripts in Exasol?
Scripts are limited to using the Exasol scripting language (Lua), whereas UDFs support several programming languages and can be used to run complex functions. A script is also a sequential program that only runs on a single cluster node, while UDFs are executed in parallel on all cluster nodes. This means that scripts are not suitable for processing result tables of SQL queries, since iterative operations on big data sets will be very time-consuming.
UDF programming languages
UDF scripts can be written in Java, Lua, Python, and R. Each of these programming languages provides different functionalities (for example, statistical functions in R) and different libraries. To learn more about how you can create UDFs in each specific programming language, including examples that you can use as a starting point, see Details for Programming Languages.
Extensions
You can implement the following extensions in your UDF scripts:
Performance
Embed UDFs in SELECT statements
To ensure optimum performance and scalability we recommend that you always create your UDF using CREATE SCRIPT and then use the script within a SELECT statement. When the statement is processed by Exasol, multiple virtual machines are started
for each script and node. These virtual machines process the data independently.
For scalar functions, the input rows are distributed across the virtual machines to
achieve maximum parallelism. For SET input tuples, the virtual machines are used
per group if you specify a GROUP BY clause. Otherwise there will be only one group, which means only one node and virtual machine can process
the data.
The query cache is not used for SELECT statements in UDF scripts. See also Query Cache.
Choose the right programming language based on purpose and context
From a performance perspective, which programming language you should use in a UDF script depends on the purpose and context of the UDF, since specific elements may have different capacities in each language. For example, string processing can be faster in one language, while XML parsing can be faster in another. If overall performance is the most important criteria we recommend using Lua, as this language is integrated natively in Exasol and therefore has the smallest process overhead.
GPU support for UDFs
In on-premises installations of Exasol 2025.2 and later you can utilize GPUs to accelerate parallel processing for user defined functions (UDFs) in Exasol clusters. To learn more about how to use GPU support in UDFs, see GPU support for UDFs. To learn how to install and configure GPU support in a new Exasol deployment, see Install and configure GPU support.
A technology preview version of this feature is available in Exasol 2025.1. If you want to try out GPU acceleration in Exasol 2025.1, contact Support.
Input and output values
In the CREATE SCRIPT command, you must define the type of input and output values. For example, you can create a script that generates multiple result rows out of a single input row (SCALAR ... EMITS).
| Input values | |
SCALAR
|
Specifies that the script processes single input rows. The code is therefore called once per input row. |
SET
|
Specifies that the processing refers to a set of input values. Within the code, you can iterate through those values. For more information, see UDF Examples. |
| Output values | |
RETURNS
|
Specifies that the script returns a single value. |
EMITS
|
Specifies that the script can create (emit) multiple result rows (tuples). If the input type is |
You can define the data types of input and output parameters to specify the conversion between internal data types and the database SQL data types. If you do not specify the data types, the script has to handle that dynamically.
Each UDF script must contain the main function run(). This function is called with a parameter providing access to the
input data of Exasol. If your script processes multiple input tuples (using SET), you can iterate through the
single tuples using this parameter.
You can specify
an ORDER BY clause either when creating a script (CREATE SCRIPT) or when calling it. This clause sorts the processing of the groups of SET input
data. If it is necessary for the algorithm,
you should specify this clause when creating the script to avoid
wrong results due to misuse.
Input parameters in scripts are always case sensitive, similar to the script code. This is different to SQL identifiers, which are only case sensitive if they are delimited.
UDF script signature generator
Exasol provides a web application that allows you to generate a UDF script template based on the programming language, input type, script name, parameters, and output type. The resulting SQL contains the UDF signature, a language-specific run method, and an example on how to iterate over and access the input and produce output. This template can be used as a starting point when developing your own UDFs.