UDF Examples

This section explains the functionality of user defined scripts.

Scalar functions

User defined scalar functions (keyword SCALAR) are the simplest case of user defined scripts, returning one scalar result value (keyword RETURNS) or several result tuples (keyword SET) for each input value or tuple.

The scripts must implement a function run() in which the processing is done. This function is called during the execution and gets a kind of context as parameter, which is the actual interface between the script and the database.

In the following example, a script is defined that returns the maximum of two values. This is equivalent to the CASE expression CASE WHEN x>=y THEN x WHEN x<y THEN y ELSE NULL END.

--/
CREATE LUA SCALAR SCRIPT my_maximum (a DOUBLE, b DOUBLE)
    RETURNS DOUBLE AS
function run(ctx)
 if ctx.a == null or ctx.b == null
    then return null
 end
 if ctx.a > ctx.b
    then return ctx.a
    else return ctx.b
 end
end

/
SELECT x,y,my_maximum(x,y) FROM t;

Aggregate and analytic functions

UDF scripts are important if the script processes multiple rows at once. You can create any kind of aggregate or analytic function. By defining the keyword SET you specify that multiple input tuples are processed. Within the run() method, you can iterate through this data using the method next().

Additionally, scripts can either return a single scalar value (keyword RETURNS) or multiple result tuples (keyword EMITS).

The following example defines two scripts: the aggregate function my_average (simulates AVG) and the analytic function my_sum. The scripts create three values per input row: a sequential number, the current value, and the sum of the previous values. The second script processes the input data in sorted order using the ORDER BY clause.

--/
CREATE LUA SET SCRIPT my_average (a DOUBLE)
    RETURNS DOUBLE AS
function run(ctx)
  if ctx.size() == 0
    then return null
  else
    local num_sum = 0
    repeat
      if ctx.a ~= null then
        num_sum = num_sum + ctx.a    
      end
    until not ctx.next()
    return num_sum/ctx.size() 
  end
end
/

SELECT my_average(x) FROM t;
--/
CREATE LUA SET SCRIPT my_sum (a DOUBLE)
    EMITS (count_values DOUBLE, val DOUBLE, sum_values DOUBLE) AS
function run(ctx)
  local num_sum = 0
  local num_count = 0
  repeat
    if ctx.a ~= null then
      num_sum = num_sum + ctx.a
      num_count = num_count + 1
      ctx.emit(num_count,ctx.a,num_sum)
       end
  until not ctx.next()
end
/

SELECT my_sum(x ORDER BY x) FROM t;

Dynamic input and output parameters

To create flexible scripts with dynamic parameters, you can use the syntax (...) within CREATE SCRIPT. You can use these scripts for any input data type, such as a maximum function independent of the data type, and for multiple input columns. You can make the number of output parameters and their type dynamic by using an EMITS script.

To access and evaluate dynamic input parameters in UDF scripts, extract the number of input parameters and their types from the metadata, then access each parameter value by its index. For example, in Python, the number of input parameters is stored in the variable exa.meta.input_column_count.

If the UDF script is defined with dynamic output parameters, the actual output parameters and their types are dynamically determined whenever the UDF is called. There are three possibilities:

  • You can specify the output parameters directly in the query after the UDF call, using the EMITS keyword followed by the names and types the UDF should output in this specific call.

  • If the UDF is used in the top level SELECT of an INSERT INTO SELECT statement, the columns of the target table are used as output parameters.

  • If EMITS and INSERT INTO SELECT are not specified, the database tries to call the function default_output_columns() (the name varies, this example is for Python) which returns the output parameters dynamically, for example, based on the input parameters. This method can be implemented by the user. For more information see Details for Programming Languages.

The default_output_columns() function is called before the run(ctx) function, since it provides input for the query execution plan. The default_output_columns() function can only access the exa object, which means that it can either access the metadata exa.meta, the connection object (CREATE CONNECTION) trough exa.get_connection(), or other scripts through exa.import_scripts(). The emit columns can therefore not be data dependent and must be known before.

Example 1:
-- Define a simple sampling script where the last parameter defines
-- the percentage of samples to be emitted.

--/
CREATE PYTHON3 SCALAR SCRIPT sample_simple (...) EMITS (...) AS
from random import randint, seed
seed(1001)
def run(ctx):
 percent = ctx[exa.meta.input_column_count-1]
 if randint(0,100) <= percent:
  currentRow = [ctx[i] for i in range(0, exa.meta.input_column_count-1)]
  ctx.emit(*currentRow)
/
Example 2:
-- This is the same UDF, but output arguments are automatically generated
-- to avoid explicit EMITS definition in SELECT.
-- In default_output_columns(), a prefix 'c' is added to the column names
-- because the input columns are auto-generated numbers

--/
CREATE PYTHON3 SCALAR SCRIPT sample (...) EMITS (...) AS
from random import randint, seed
seed(1001)
def run(ctx):
 percent = ctx[exa.meta.input_column_count-1]
 if randint(0,100) <= percent:
  currentRow = [ctx[i] for i in range(0, exa.meta.input_column_count-1)]
  ctx.emit(*currentRow)
def default_output_columns():
 output_args = list()
 for i in range(0, exa.meta.input_column_count-1):
  name = exa.meta.input_columns[i].name
  type = exa.meta.input_columns[i].sql_type
  output_args.append("c" + name + " " + type)
 return str(", ".join(output_args))
/
Example 3:
ID USER_NAME PAGE_VISITS
1 Alice 12
2 Bob 4
3 Pete 0
4 Hans 101
5 John 32
6 Peter 21
7 Graham 21
8 Steve 4
9 Bill 64
10 Caludia 201
-- The first UDF requires to specify the output columns via EMITS.
-- Here, 20% of rows should be extracted randomly.

SELECT sample_simple(id, user_name, page_visits, 20)
 EMITS (id INT, user_name VARCHAR(100), PAGE_VISITS int)
 FROM people;
ID USER_NAME PAGE_VISITS
2 Bob 4
5 John 32
-- In case of INSERT INTO, the UDF uses the target types automatically
--/
CREATE TABLE people_sample LIKE people;
INSERT INTO people_sample
 SELECT sample_simple(id, user_name, page_visits, 20) FROM people;

MapReduce programs

UDF scripts are flexible and allow you to implement various types of analysis. This section shows an example of a MapReduce program that calculates the frequency of single words within a text, which is a problem that a standard SQL cannot solve.

In the example, the script map_words extracts single words out of a text and emits them. This script is integrated in an SQL query without requiring an additional aggregation script (a typical Reduce step of MapReduce), because of the built-in SQL function COUNT. This reduces the implementation effort because built-in SQL functions are already available in Exasol. Additionally, this improves the performance since the SQL execution within the built-in functions is more native.

Example:
--/
CREATE LUA SCALAR SCRIPT map_words(w varchar(10000))
EMITS (words varchar(100)) AS
function run(ctx)
    local word = ctx.w
    if (word ~= null)
    then
      for i in unicode.utf8.gmatch(word,'([%w%p]+)')
       do
        ctx.emit(i)
      end
    end
end
/

SELECT words, COUNT(*) FROM
  (SELECT map_words(l_comment) FROM tpc.lineitem)
GROUP BY words ORDER BY 2 desc LIMIT 10;
Example results:
Words Count (*)
the 1376964
slyly 649761
regular 619211
final 542206
carefully 535430
furiously 534054
ironic 519784
blithely 450438
even 425013
quickly 354712

Access to external services

You can exchange data with external services in scripts. In the following example, a list of URLs stored in a table is processed. The script reads the corresponding documents from a web server and calculates the length of the document.

Each script language provides different libraries to connect to the internet.

Example:
--/
CREATE LUA SCALAR SCRIPT length_of_doc (url VARCHAR(50))
    EMITS (url VARCHAR(50), doc_length DOUBLE) AS
HTTP = require("socket.http")
function run(ctx)
    file = http.request(ctx.url)
    if file == nil then error('Cannot open URL ' .. ctx.url) end
    ctx.emit(ctx.url, unicode.utf8.len(file))
end
/
 
SELECT length_of_doc(url) FROM t;
Example results:
URL DOC_LENGTH
http://en.wikipedia.org/wiki/Main_Page.htm 59960
http://en.wikipedia.org/wiki/Exasol 30007

User-defined ETL using UDFs

You can use UDF scripts to implement flexible ETL processes by defining how to extract and convert data from external sources. For more information, see User-defined IMPORT/EXPORT Using UDFs.