UDF Examples
This section explains the functionality of user defined scripts.
Scalar functions
User defined scalar functions (keyword SCALAR) are the simplest case of user defined scripts, returning one scalar result value (keyword RETURNS) or several result tuples (keyword SET) for each input value or tuple.
The scripts must implement a function run() in which the processing is done. This function is called during the execution and gets a kind of context as parameter, which is the actual interface between the script and the database.
In the following example, a script is defined that returns the maximum of two values. This is equivalent to the CASE expression CASE WHEN x>=y THEN x WHEN x<y THEN y ELSE NULL END.
--/
CREATE LUA SCALAR SCRIPT my_maximum (a DOUBLE, b DOUBLE)
RETURNS DOUBLE AS
function run(ctx)
if ctx.a == null or ctx.b == null
then return null
end
if ctx.a > ctx.b
then return ctx.a
else return ctx.b
end
end
/
SELECT x,y,my_maximum(x,y) FROM t;
Aggregate and analytic functions
UDF scripts are important if the script processes multiple rows at once. You can create any kind of aggregate or analytic function. By defining the keyword SET you specify that multiple input tuples are processed. Within the run() method, you can iterate through this data using the method next().
Additionally, scripts can either return a single scalar value (keyword RETURNS) or multiple result tuples (keyword EMITS).
The following example defines two scripts: the aggregate function my_average (simulates AVG) and the analytic function my_sum. The scripts create three values per input row: a sequential number, the current value, and the sum of the previous values. The second script processes the input data in sorted order using the ORDER BY clause.
--/
CREATE LUA SET SCRIPT my_average (a DOUBLE)
RETURNS DOUBLE AS
function run(ctx)
if ctx.size() == 0
then return null
else
local num_sum = 0
repeat
if ctx.a ~= null then
num_sum = num_sum + ctx.a
end
until not ctx.next()
return num_sum/ctx.size()
end
end
/
SELECT my_average(x) FROM t;
--/
CREATE LUA SET SCRIPT my_sum (a DOUBLE)
EMITS (count_values DOUBLE, val DOUBLE, sum_values DOUBLE) AS
function run(ctx)
local num_sum = 0
local num_count = 0
repeat
if ctx.a ~= null then
num_sum = num_sum + ctx.a
num_count = num_count + 1
ctx.emit(num_count,ctx.a,num_sum)
end
until not ctx.next()
end
/
SELECT my_sum(x ORDER BY x) FROM t;
Dynamic input and output parameters
You can use the syntax (...) within CREATE SCRIPT to create flexible scripts with dynamic parameters. You can use these scripts for any input data type, for example, a maximum function independent of the data type, and for multiple input columns. With an EMITS script you can make the number of output parameters and their type dynamic.
To access and evaluate dynamic input parameters in UDF scripts, extract the number of input parameters and their types from the metadata, then access each parameter value by its index. For example, in Python, the number of input parameters is stored in the variable exa.meta.input_column_count.
If the UDF script is defined with dynamic output parameters, the actual output parameters and their types are determined dynamically whenever the UDF is called. There are three possibilities:
- You can specify the output parameters directly in the query after the UDF call, using the EMITS keyword followed by the names and types the UDF should output in this specific call.
- If the UDF is used in the top level SELECT of an INSERT INTO SELECT statement, the columns of the target table are used as output parameters.
- If EMITS and INSERT INTO SELECT are not specified, the database tries to call the function default_output_columns() (the name varies, this example is for Python) which returns the output parameters dynamically, for example, based on the input parameters. This method can be implemented by the user. For more information see Details for Programming Languages.
-- Define a pretty simple sampling script where the last parameter defines
-- the percentage of samples to be emitted.
--/
CREATE PYTHON3 SCALAR SCRIPT sample_simple (...) EMITS (...) AS
from random import randint, seed
seed(1001)
def run(ctx):
percent = ctx[exa.meta.input_column_count-1]
if randint(0,100) <= percent:
currentRow = [ctx[i] for i in range(0, exa.meta.input_column_count-1)]
ctx.emit(*currentRow)
/
-- This is the same UDF, but output arguments are generated automatically
-- to avoid explicit EMITS definition in SELECT.
-- In default_output_columns(), a prefix 'c' is added to the column names
-- because the input columns are auto-generated numbers
--/
CREATE PYTHON3 SCALAR SCRIPT sample (...) EMITS (...) AS
from random import randint, seed
seed(1001)
def run(ctx):
percent = ctx[exa.meta.input_column_count-1]
if randint(0,100) <= percent:
currentRow = [ctx[i] for i in range(0, exa.meta.input_column_count-1)]
ctx.emit(*currentRow)
def default_output_columns():
output_args = list()
for i in range(0, exa.meta.input_column_count-1):
name = exa.meta.input_columns[i].name
type = exa.meta.input_columns[i].sql_type
output_args.append("c" + name + " " + type)
return str(", ".join(output_args))
/
Example table
ID | USER_NAME | PAGE_VISITS |
---|---|---|
1 | Alice | 12 |
2 | Bob | 4 |
3 | Pete | 0 |
4 | Hans | 101 |
5 | John | 32 |
6 | Peter | 21 |
7 | Graham | 21 |
8 | Steve | 4 |
9 | Bill | 64 |
10 | Caludia | 201 |
-- The first UDF requires to specify the output columns via EMITS.
-- Here, 20% of rows should be extracted randomly.
SELECT sample_simple(id, user_name, page_visits, 20)
EMITS (id INT, user_name VARCHAR(100), PAGE_VISITS int)
FROM people;
Example:
ID | USER_NAME | PAGE_VISITS |
---|---|---|
2 | Bob | 4 |
5 | John | 32 |
-- In case of INSERT INTO, the UDF uses the target types automatically
--/
CREATE TABLE people_sample LIKE people;
INSERT INTO people_sample
SELECT sample_simple(id, user_name, page_visits, 20) FROM people;
MapReduce programs
UDF scripts are flexible and allow you to implement various types of analysis. This section shows an example of a MapReduce program that calculates the frequency of single words within a text, which is a problem that a standard SQL cannot solve.
In the example, the script map_words extracts single words out of a text and emits them. This script is integrated in an SQL query without requiring an additional aggregation script (a typical Reduce step of MapReduce), because of the built-in SQL function COUNT. This reduces the implementation effort because built-in SQL functions are already available in Exasol. Additionally, this improves the performance since the SQL execution within the built-in functions is more native.
--/
CREATE LUA SCALAR SCRIPT map_words(w varchar(10000))
EMITS (words varchar(100)) AS
function run(ctx)
local word = ctx.w
if (word ~= null)
then
for i in unicode.utf8.gmatch(word,'([%w%p]+)')
do
ctx.emit(i)
end
end
end
/
SELECT words, COUNT(*) FROM
(SELECT map_words(l_comment) FROM tpc.lineitem)
GROUP BY words ORDER BY 2 desc LIMIT 10;
Example results:
Words | Count (*) |
---|---|
the | 1376964 |
slyly | 649761 |
regular | 619211 |
final | 542206 |
carefully | 535430 |
furiously | 534054 |
ironic | 519784 |
blithely | 450438 |
even | 425013 |
quickly | 354712 |
Access to external services
You can exchange data with external services in scripts. In the following example, a list of URLs stored in a table is processed. The script reads the corresponding documents from a web server and calculates the length of the document.
Each script language provides different libraries to connect to the internet.
--/
CREATE LUA SCALAR SCRIPT length_of_doc (url VARCHAR(50))
EMITS (url VARCHAR(50), doc_length DOUBLE) AS
HTTP = require("socket.http")
function run(ctx)
file = http.request(ctx.url)
if file == nil then error('Cannot open URL ' .. ctx.url) end
ctx.emit(ctx.url, unicode.utf8.len(file))
end
/
SELECT length_of_doc(url) FROM t;
Example results:
URL | DOC_LENGTH |
---|---|
http://en.wikipedia.org/wiki/Main_Page.htm | 59960 |
http://en.wikipedia.org/wiki/Exasol | 30007 |
User-defined ETL using UDFs
You an use the UDF scripts to implement flexible ETL processes by defining how to extract and convert data from external sources. For more information, see User-defined IMPORT/EXPORT Using UDFs.