UDFs overview
This article provides an overview of how to use user defined functions (UDFs) in Exasol.
Introduction
User defined functions (UDFs) in Exasol allow you to program your own analysis, processing, and generation functions and to execute these functions in parallel inside an Exasol cluster. By using UDFs you can solve problems that are not possible to solve using only SQL statements.
What is the difference between UDFs and scripts in Exasol?
Scripts are limited to using the Exasol scripting language (Lua), whereas UDFs support several programming languages and can be used to run complex functions. A script is also a sequential program that only runs on a single cluster node, while UDFs are executed in parallel on all cluster nodes. This means that scripts are not suitable for processing result tables of SQL queries, since iterative operations on big data sets will be very time-consuming.
Which programming languages are supported?
UDF scripts can be written in Java, Lua, Python, and R. Each of these programming languages provides different functionalities (for example, statistical functions in R) and different libraries. To learn more about how you can create UDFs in each specific programming language, including examples that you can use as a starting point, see Details for programming languages.
Extensions
You can implement extensions in your UDFs to use scalar functions, aggregate and analytic functions, MapReduce programs, and user-defined ETL. For more information and examples, see UDF examples.
Environment variables
You can set environment variables for the script execution using the syntax %env <variable>=<value>. One use case for this would be to
set the path to libraries that are stored in BucketFS.
Example:
%env MY_LIBRARY_PATH=/buckets/bfsdefault/my_libs/my_lib
Performance
The following guidelines will help you ensure good performance when using UDFs.
Always embed UDFs in SELECT statements
To ensure optimum performance and scalability we recommend that you always create your UDF using CREATE SCRIPT and then use the script within a SELECT statement. When the SELECT statement is processed by Exasol, multiple virtual machines are started for each script and node. These virtual machines process the data independently in parallel.
For scalar functions, the input rows are distributed across the virtual machines to
achieve maximum parallelism. For SET input tuples, the virtual machines are used
per group if you specify a GROUP BY clause. Otherwise there will be only one group, which means only one node and virtual machine can process
the data.
The query cache is not used for SELECT statements in UDF scripts.
Choose the right programming language based on purpose and context
From a performance perspective, which programming language to use in a UDF script depends on both the purpose and the context of the UDF, since specific elements may have different capacities in each language. For example, string processing can be faster in one language, while XML parsing can be faster in another.
If overall performance is the most important criteria we recommend using Lua, since this language is integrated natively in Exasol and therefore has the smallest process overhead.
To learn more, see Details for programming languages.
UDF instance limiting
To avoid query out-of-memory situations, you can limit the number of UDF instances used for each UDF call per node. Instance limiting can be used in UDF scripts written in Java, Python, and R.
To learn more, see UDF instance limiting.
Input and output values
In the CREATE SCRIPT command, you must define the type of input and output values. For example, you can create a script that generates multiple result rows out of a single input row (SCALAR ... EMITS).
| Input values | |
SCALAR
|
Specifies that the script processes single input rows. The code is therefore called once per input row. |
SET
|
Specifies that the processing refers to a set of input values. Within the code, you can iterate through those values. For more information, see Examples. |
| Output values | |
RETURNS
|
Specifies that the script returns a single value. |
EMITS
|
Specifies that the script can create (emit) multiple result rows (tuples). If the input type is |
You can define the data types of input and output parameters to specify the conversion between internal data types and the database SQL data types. If you do not specify the data types, the script has to handle that dynamically.
Each UDF script must contain the main function run(). This function is called with a parameter providing access to the
input data of Exasol. If your script processes multiple input tuples (using SET), you can iterate through the
single tuples using this parameter.
You can specify
an ORDER BY clause either when creating a script (CREATE SCRIPT) or when calling it. This clause sorts the processing of the groups of SET input
data. If it is necessary for the algorithm,
you should specify this clause when creating the script to avoid
wrong results due to misuse.
Input parameters in scripts are always case sensitive, similar to the script code. This is different from SQL identifiers, which are only case sensitive if they are delimited.
GPU support for UDFs
In on-premises installations of Exasol 2025.2 and later you can utilize GPUs to accelerate parallel processing for user defined functions (UDFs) in Exasol clusters. To learn more about how to use GPU support in UDFs, see GPU support for UDFs. To learn how to install and configure GPU support in a new Exasol deployment, see Install and configure GPU support.
A technology preview version of this feature is available in Exasol 2025.1. For more information, contact Support.
UDF script signature generator
Exasol provides a web application that allows you to generate a UDF template based on the programming language, input type, script name, parameters, and output type. The resulting SQL contains the UDF signature, a language-specific run method, and an example on how to iterate over and access the input and produce output. This template can be used as a starting point when developing your own UDFs.