UDF Overview

This article explains how to get started with using user defined functions (UDFs) in Exasol.

Introduction

User defined functions (UDFs) in Exasol allow you to program your own analysis, processing, and generation functions and to execute these functions in parallel inside an Exasol cluster. By using UDFs you can solve problems that are not possible to solve using only SQL statements.

What is the difference between UDFs and scripts in Exasol?

Scripts are limited to using the Exasol scripting language (Lua), whereas UDFs support several programming languages and can be used to run complex functions. A script is also a sequential program that only runs on a single cluster node, while UDFs are executed in parallel on all cluster nodes. This means that scripts are not suitable for processing result tables of SQL queries, since iterative operations on big data sets will be very time-consuming.

UDF programming languages

UDF scripts can be written in Java, Lua, Python, and R. Each of these programming languages provides different functionalities (for example, statistical functions in R) and different libraries. To learn more about how you can create UDFs in each specific programming language, including examples that you can use as a starting point, see Details for Programming Languages.

Extensions

You can implement the following extensions in your UDF scripts:

To learn more, see the examples in the respective linked articles.

Performance

For best performance, we recommend that you always create your UDF script using CREATE SCRIPT and then use the script within a SELECT statement. Embedding the script within SQL in this way provides the best performance and scalability.

When a SELECT statement is processed by Exasol, multiple virtual machines are started for each script and node. These virtual machines process the data independently. For scalar functions, the input rows are distributed across those virtual machines to achieve maximum parallelism. For SET input tuples, the virtual machines are used per group if you specify a GROUP BY clause. Otherwise there will be only one group, which means only one node and virtual machine can process the data.

The query cache is not used for SELECT statements in UDF scripts. See also Query Cache.

Choosing a programming language for best performance

From a performance perspective, which programming language you should use in a UDF script depends on the purpose and context of the UDF, since specific elements may have different capacities in each language. For example, string processing can be faster in one language, while XML parsing can be faster in another. However, if overall performance is the most important criteria, we recommend using Lua. Lua is integrated natively in Exasol and therefore has the smallest process overhead.

Input and output values

In the CREATE SCRIPT command, you must define the type of input and output values. For example, you can create a script that generates multiple result rows out of a single input row (SCALAR ... EMITS).

Input values
SCALAR

Specifies that the script processes single input rows. The code is therefore called once per input row.

SET Specifies that the processing refers to a set of input values. Within the code, you can iterate through those values. For more information, see UDF Examples.
Output values
RETURNS Specifies that the script returns a single value.
EMITS

Specifies that the script can create (emit) multiple result rows (tuples).

If the input type is SET, the EMITS result can only be used alone, not combined with other expressions. However, you can nest it through a subselect to do further processing of those intermediate results.

You can define the data types of input and output parameters to specify the conversion between internal data types and the database SQL data types. If you do not specify the data types, the script has to handle that dynamically.

Each UDF script must contain the main function run(). This function is called with a parameter providing access to the input data of Exasol. If your script processes multiple input tuples (using SET), you can iterate through the single tuples using this parameter.

You can specify an ORDER BY clause either when creating a script (CREATE SCRIPT) or when calling it. This clause sorts the processing of the groups of SET input data. If it is necessary for the algorithm, you should specify this clause when creating the script to avoid wrong results due to misuse.

Input parameters in scripts are always case sensitive, similar to the script code. This is different to SQL identifiers, which are only case sensitive if they are delimited.

UDF Script Signature Generator

Exasol provides a web application that allows you to generate a template for a UDF script based on programming language, input type, script name, parameters, and output type. The resulting SQL contains the UDF signature, a language specific run method, and an example on how to iterate over and access the input, and how to produce output. This can then be used as a template and as a starting point for developing your UDFs.

UDF Script Signature Generator .