UDF Examples
This article explains the functionality of user defined scripts.
Scalar functions
User defined scalar functions (keyword SCALAR
) are the simplest case of user defined scripts, returning one
scalar result value (keyword RETURNS
) or several result tuples (keyword SET
) for each input value or tuple.
The scripts must implement a function run()
in which the processing is done. This function is
called during the execution and gets a kind of context as parameter, which is the
actual interface between the script and the database.
In the following example, a script is defined that returns the maximum of two values. This is equivalent to the CASE
expression CASE WHEN x>=y THEN x WHEN x<y THEN y ELSE NULL END
.
Aggregate and analytic functions
UDF scripts are important if the script processes multiple rows at once. You can create
any kind of aggregate or analytic function. By defining the keyword SET
you specify that multiple input tuples
are processed. Within the run()
method, you can iterate through this data using the method next()
.
Additionally, scripts can either return a single scalar value (keyword RETURNS
) or multiple result tuples (keyword
EMITS
).
The following example defines two scripts: the aggregate function my_average
(simulates AVG
) and the analytic
function my_sum
. The scripts create three values per input row: a sequential number, the current value, and
the sum of the previous values. The second script processes the input data in sorted order using the ORDER BY
clause.
--/
CREATE LUA SET SCRIPT my_average (a DOUBLE)
RETURNS DOUBLE AS
function run(ctx)
if ctx.size() == 0
then return null
else
local num_sum = 0
repeat
if ctx.a ~= null then
num_sum = num_sum + ctx.a
end
until not ctx.next()
return num_sum/ctx.size()
end
end
/
SELECT my_average(x) FROM t;
--/
CREATE LUA SET SCRIPT my_sum (a DOUBLE)
EMITS (count_values DOUBLE, val DOUBLE, sum_values DOUBLE) AS
function run(ctx)
local num_sum = 0
local num_count = 0
repeat
if ctx.a ~= null then
num_sum = num_sum + ctx.a
num_count = num_count + 1
ctx.emit(num_count,ctx.a,num_sum)
end
until not ctx.next()
end
/
SELECT my_sum(x ORDER BY x) FROM t;
Dynamic input and output parameters
To create flexible scripts with dynamic parameters, you can use the syntax (...)
within CREATE SCRIPT. You can use these scripts for any input data type, such as a maximum function independent of the data type, and for multiple input columns. You can make the number of output parameters and their type dynamic by using an EMITS
script.
To access and evaluate dynamic input parameters in UDF scripts, extract the number of input parameters
and their types from the metadata, then access each parameter value by its index. For example, in Python, the
number of input parameters is stored in the variable exa.meta.input_column_count
.
If the UDF script is defined with dynamic output parameters, the actual output parameters and their types are dynamically determined whenever the UDF is called. There are three possibilities:
-
You can specify the output parameters directly in the query after the UDF call, using the
EMITS
keyword followed by the names and types the UDF should output in this specific call. -
If the UDF is used in the top level
SELECT
of anINSERT INTO SELECT
statement, the columns of the target table are used as output parameters. -
If
EMITS
andINSERT INTO SELECT
are not specified, the database tries to call the functiondefault_output_columns()
(the name varies, this example is for Python) which returns the output parameters dynamically, for example, based on the input parameters. This method can be implemented by the user. For more information see Details for Programming Languages.
The default_output_columns()
function is called before the run(ctx)
function, since it provides input for the query execution plan. The default_output_columns()
function can only access the exa
object, which means that it can either access the metadata exa.meta
, the connection object (CREATE CONNECTION
) trough exa.get_connection()
, or other scripts through exa.import_scripts()
. The emit columns can therefore not be data dependent and must be known before.
Example 1:
-- Define a simple sampling script where the last parameter defines
-- the percentage of samples to be emitted.
--/
CREATE PYTHON3 SCALAR SCRIPT sample_simple (...) EMITS (...) AS
from random import randint, seed
seed(1001)
def run(ctx):
percent = ctx[exa.meta.input_column_count-1]
if randint(0,100) <= percent:
currentRow = [ctx[i] for i in range(0, exa.meta.input_column_count-1)]
ctx.emit(*currentRow)
/
Example 2:
-- This is the same UDF, but output arguments are automatically generated
-- to avoid explicit EMITS definition in SELECT.
-- In default_output_columns(), a prefix 'c' is added to the column names
-- because the input columns are auto-generated numbers
--/
CREATE PYTHON3 SCALAR SCRIPT sample (...) EMITS (...) AS
from random import randint, seed
seed(1001)
def run(ctx):
percent = ctx[exa.meta.input_column_count-1]
if randint(0,100) <= percent:
currentRow = [ctx[i] for i in range(0, exa.meta.input_column_count-1)]
ctx.emit(*currentRow)
def default_output_columns():
output_args = list()
for i in range(0, exa.meta.input_column_count-1):
name = exa.meta.input_columns[i].name
type = exa.meta.input_columns[i].sql_type
output_args.append("c" + name + " " + type)
return str(", ".join(output_args))
/
Example 3:
ID | USER_NAME | PAGE_VISITS |
---|---|---|
1 | Alice | 12 |
2 | Bob | 4 |
3 | Pete | 0 |
4 | Hans | 101 |
5 | John | 32 |
6 | Peter | 21 |
7 | Graham | 21 |
8 | Steve | 4 |
9 | Bill | 64 |
10 | Caludia | 201 |
-- The first UDF requires to specify the output columns via EMITS.
-- Here, 20% of rows should be extracted randomly.
SELECT sample_simple(id, user_name, page_visits, 20)
EMITS (id INT, user_name VARCHAR(100), PAGE_VISITS int)
FROM people;
ID | USER_NAME | PAGE_VISITS |
---|---|---|
2 | Bob | 4 |
5 | John | 32 |
MapReduce programs
UDF scripts are flexible and allow you to implement various types of analysis. This section shows an example of a MapReduce program that calculates the frequency of single words within a text, which is a problem that a standard SQL cannot solve.
In the example, the script map_words
extracts single words out of a text and emits them. This script is integrated
in an SQL query without requiring an additional aggregation script (a typical Reduce step of
MapReduce), because of the built-in SQL function COUNT
. This reduces the implementation effort because built-in SQL functions are already available in Exasol. Additionally, this improves the performance since the SQL execution within the built-in functions is more native.
Example:
--/
CREATE LUA SCALAR SCRIPT map_words(w varchar(10000))
EMITS (words varchar(100)) AS
function run(ctx)
local word = ctx.w
if (word ~= null)
then
for i in unicode.utf8.gmatch(word,'([%w%p]+)')
do
ctx.emit(i)
end
end
end
/
SELECT words, COUNT(*) FROM
(SELECT map_words(l_comment) FROM tpc.lineitem)
GROUP BY words ORDER BY 2 desc LIMIT 10;
Example results:
Words | Count (*) |
---|---|
the | 1376964 |
slyly | 649761 |
regular | 619211 |
final | 542206 |
carefully | 535430 |
furiously | 534054 |
ironic | 519784 |
blithely | 450438 |
even | 425013 |
quickly | 354712 |
Access to external services
You can exchange data with external services in scripts. In the following example, a list of URLs stored in a table is processed. The script reads the corresponding documents from a web server and calculates the length of the document.
Each script language provides different libraries to connect to the internet.
Example:
--/
CREATE LUA SCALAR SCRIPT length_of_doc (url VARCHAR(50))
EMITS (url VARCHAR(50), doc_length DOUBLE) AS
HTTP = require("socket.http")
function run(ctx)
file = http.request(ctx.url)
if file == nil then error('Cannot open URL ' .. ctx.url) end
ctx.emit(ctx.url, unicode.utf8.len(file))
end
/
SELECT length_of_doc(url) FROM t;
Example results:
URL | DOC_LENGTH |
---|---|
http://en.wikipedia.org/wiki/Main_Page.htm | 59960 |
http://en.wikipedia.org/wiki/Exasol | 30007 |
User-defined ETL using UDFs
You can use UDF scripts to implement flexible ETL processes by defining how to extract and convert data from external sources. For more information, see User-defined IMPORT/EXPORT Using UDFs.