R

This article explains how to use the R language for UDF scripting.

The following describes using R 3.4.4. For additional information about R, refer to the official R Documentation.

run() and cleanup() Methods

The method run() is called for each input tuple (SCALAR) or each group (SET). Its parameter is a type of execution context and provides access to the data and the iterator in case of a SET script.

To initialize expensive steps (for example, opening external connections), you can write code outside the run() method, since this code is executed once at the beginning by each virtual machine. For deinitialization purpose, the method cleanup() is available. The method cleanup() is called once for each virtual machine, at the end of the execution.

Parameters

The internal R data types and the database SQL types are not identical. Therefore, casts must be done for the input and output data:

SQL data type R data type
DECIMAL(p,0) for p≤9 integer
DECIMAL(p,s) and DOUBLE double
DATE and TIMESTAMP POSIXt
BOOLEAN logical
VARCHAR and CHAR character
Other Not supported

The value NA is the equivalent of the SQL NULL.

For better performance, you should prefer DOUBLE to DECIMAL(p,s) for the parameter types.

The input parameters can be addressed by their names, for example, ctx$my_input. ctx refers to the name of the context parameter that is passed to the run() method.

However, you can also use a dynamic number of parameters through the notation (...), for example, CREATE R SCALAR SCRIPT my_script (...). The parameters can then be accessed through an index. In this case a special notation is required, for example, ctx[[1]]() for the first parameter. The number of parameters and their data types (both determined during the call of the script) are part of the metadata.

Metadata

You can access the following metadata through global variables:

Metadata Description

exa$meta$database_name

Database name

exa$meta$database_version

Database version

exa$meta$script_language

Name and version of the script language

exa$meta$script_name

Name of the script

exa$meta$script_schema

Schema in which the script is stored

exa$meta$current_schema

Schema which is currently opened

exa$meta$script_code

Code of the script

exa$meta$session_id

Session ID

exa$meta$statement_id

Statement ID within the session

exa$meta$current_user

Current user

exa$meta$scope_user

Scope user (current_user or the owner of a view if the udf script is called within a view)

exa$meta$node_count

Number of cluster nodes

exa$meta$node_id

Local node ID starting with 0

exa$meta$vm_id

Unique ID for the local machine (the IDs of the virtual machines have no relation to each other)

exa$meta$input_type

Type of the input data (SCALAR or SET)

exa$meta$input_column_count

Number of input columns

exa$meta$input_columns[]

Array including the following information: {name, type, sql_type, precision, scale, length}

exa$meta$output_type

Type of the output data (RETURNS or EMITS)

exa$meta$output_column_count

Number of output columns

exa$meta$output_columns[]

Array including the following information: {name, type, sql_type, precision, scale, length}

Data iterator

For scripts having multiple input tuples per call (keyword SET) there are two ways to access the data of your group:

  • Iterate over the data, one record at a time.
  • Load all (or multiple) rows for the group into a vector in memory to run R operations on it.

To iterate over the data, use the method next_row() which is accessible through the context. The function differs from the other languages, because next is a reserved keyword in R.

Initially, the iterator points to the first input row. That's why a repeat...until loop can be ideally used to iterate through the data (see examples). If the input data is empty, then the run() method will not be called, and similar to aggregate functions the NULL value is returned as result (for example, SELECT MAX(x) FROM t WHERE false).

To access the data as a vector, you can call the method next_row with a parameter that specifies how many records you want to read into the vector (for example, next_row(1000) or next_row(NA)). If you specify NA, all records of the group will be read into a single vector.

The vector will be held completely in memory. For very large groups this might exceed the memory limits, and you should fetch, for example, 1000 rows at a time until you processed all records of the group.

To get the actual vector with data, you can access the context property with the name of your input parameter (for example, ctx$input_word, see the example below). You have to call the next_row function first in this case, before accessing the data.

Additionally, there is a method reset() which resets the iterator to the first input element. Therefore, you can do multiple iterations through the data, if this is required for your algorithm. The method size() returns the number of input values.

emit()

You can return multiple output tuples per call (keyword EMITS) through the method emit().

The method expects as many parameters as output columns were defined. In the case of dynamic output parameters, it is useful to use a list and the do.call method like in the example, do.call(ctx$emit, list(1,"a")).

For scripts of type RETURNS, vectors can be used to improve the performance (see examples below).

Import other scripts

Other scripts can be imported through the method exa$import_script(). The return value of this method must be assigned to a variable, representing the imported module.

Syntax
<alias> <- exa$import_script('<schema>.<script>')
Examples
CREATE SCHEMA IF NOT EXISTS TEST;
--/
CREATE OR REPLACE R SCALAR SCRIPT TEST.R_DEMO() RETURNS VARCHAR(2000) AS

run <- function(ctx) {
 "Minimal R language UDF"
}
/

select TEST.R_DEMO();

CREATE SCHEMA IF NOT EXISTS LIB;

--/
CREATE OR REPLACE R SCALAR SCRIPT LIB.R_LIB() RETURNS INT AS
helloWorld <- function(){
  "R language library used in a UDF"
}
/

CREATE SCHEMA IF NOT EXISTS TEST;

--/
CREATE OR REPLACE R SCALAR SCRIPT TEST.R_DEMO2() RETURNS VARCHAR(2000) AS L <- exa$import_script('LIB.R_LIB')

run <- function(ctx) {
 L$helloWorld()
}
/

select TEST.R_DEMO2();

Access connection definitions

The data that has been specified when defining connections with CREATE CONNECTION, is available in R UDF scripts through the method exa$get_connection("<connection_name>"). The result is an R list with the following entries:

Field Description

type

The type of the connection definition. Currently only the type PASSWORD is used.

address

The part of the connection definition that followed the TO keyword in the CREATE CONNECTION command.

user

The part of the connection definition that followed the USER keyword in the CREATE CONNECTION command.

password

The part of the connection definition that followed the IDENTIFIED BY keyword in the CREATE CONNECTION command.

Auxiliary libraries

The following libraries are provided that are not already part of the language:

Library Description

bitops

Bitwise operations on integer vectors. For details, see http://cran.r-project. org/web/packages/bitops.

class

Various functions for classification For details, see http://cran.r-project. org/web/packages/class.

data.table

Extension of data.frame for fast indexing, fast ordered joins, fast assignment, fast grouping, and list columns. For details, see http://cran.r-project. org/web/packages/data.table.

e1071

Miscellaneous functions of the Department of Statistics (e1071) of the university TU Wien. For details, see http://cran.r-project.org/web/packages/e1071.

flashClust

Implementation of optimal hierarchical clustering. For details, see https://cran.r-project.org/web/packages/flashClust.

gbm

Generalized Boosted Regression Models. For details, see http://cran.r-project.org/web/packages/gbm.

Hmisc

Harrell Miscellaneous. For details, see http://cran.r-project.org/web/packages/ Hmisc)

MASS

Functions and data sets for Venables and Ripley's MASS. For details, see http:// cran.r-project.org/web/packages/MASS.

randomForest

Classification and regression algorithms. For details, see http://cran.r-project. org/web/packages/randomForest.

RCurl

Support for internet access like HTTP connections. For details, see https://cran.r-project.org/web/packages/RCurl.

RODBC

Interface for ODBC databases. For details, see http://cran.r-project. org/web/packages/RODBC.

rpart

Recursive partitioning and regression trees. For details, see http://cran.r-project. org/web/packages/rpart.

rredis

Interface for Redis. For details, see http://cran.r-project.org/web/packages/rredis.

RSXML

XML parsing. For details, see https://cran.r-project.org/web/packages/XML.

survival

Survival Analysis. For details, see http://cran.r-project.org/web/packages/survival.

Dynamic output parameters callback function

If the UDF script was defined with dynamic output parameters and the output parameters can not be determined (by specifying EMITS in the query or by INSERT INTO SELECT), the database calls the method defaultOutputColumns() which you can implement. The expected return value is a String with the names and types of the output columns, "a int, b varchar(100)". For details about when the method is called, see Dynamic Input and Output Parameters.

You can access the Metadata exa$meta in the method to find out the number and types of input columns.

The method will be executed only once on a single node.

User defined import callback function

To support a user defined import you can implement the callback method generate_sql_for_import_spec(import_spec). For details about the syntax, see Dynamic input and output parameters and IMPORT. The parameter import_spec contains all information about the executed IMPORT FROM SCRIPT statement. The function has to generate and return a SELECT statement which will retrieve the data to be imported.

import_spec has the following fields:

Fields Description

parameters

Parameters specified in the IMPORT statement. For example, parameters$FOO returns the value of parameter FOO.

is_subselect

This is true, if the IMPORT is used inside a SELECT statement and not inside an IMPORT INTO table statement.

subselect_column_names[]

If is_subselect is true and the user specified the target column names and types, this returns the names of all specified columns.

subselect_column_types[]

If is_subselect is true and the user specified the target column names and types, this returns the types of all specified columns. The types are returned in SQL format ("VARCHAR(100)").

connection_name

The name of the connection, if it's specified. The UDF script can then obtain the connection information through exa$get_connection(name). Otherwise it returns NA.

connection

If the user provided connection information this returns an object with connection information similar to the return value of exa$getConnection(name). Otherwise it returns NA.

The password is transferred into plaintext and could be visible in the logs. Therefore, It is recommended to create a CONNECTION before and to specify only the connection name (can be obtained from connection_name field). The actual connection information can be obtained through exa$get_connection(name).

User defined export callback function

To support a user defined export you can implement the callback method generate_sql_for_export_spec(export_spec). For syntax, see Dynamic input and output parameters and EXPORT statement. The parameter export_spec contains all information about the executed EXPORT INTO SCRIPT statement. The function has to generate and return a SELECT statement which will generate the data to be exported. The FROM part of that string can be a dummy table (DUAL) since the export command is aware which table should be exported. But it has to be specified to be able to compile the SQL string successfully.

export_spec has the following fields:

Fields Description

parameters

Parameters specified in the EXPORT statement. For example, parameters$FOO returns the value of parameter FOO.

source_column_names[]

List of column names of the resulting table that should be exported

has_truncate

Boolean value from the EXPORT command option that defines whether the content of the target table should be truncated before the data transfer.

has_replace

Boolean value from the EXPORT command option that defines whether the target table should be deleted before the data transfer.

created_by

String value from the EXPORT command option that defines the creation text executed in the target system before the data transfer.

connection_name

The name of the connection, if it's specified. The UDF script can then obtain the connection information through exa$get_connection(name). Otherwise it returns NA.

connection

If the user provided connection information this returns an object with connection information similar to return value of exa$get_Connection(name). Otherwise it returns NA.

The password is transferred into plaintext and could be visible in the logs. Therefore, It is recommended to create a CONNECTION before and to specify only the connection name (can be obtained from connection_name field). The actual connection information can be obtained through exa$get_connection(name).

Examples
-- A simple SCALAR RETURNS function, similar to the SQL function UPPER()
--/
CREATE R SCALAR SCRIPT r_upper(input_word VARCHAR(50))
RETURNS VARCHAR(50) AS
   run <- function(ctx) {
   # the last statement in a function is automatically used as return value
   toupper(ctx$input_word)
}
/
 
SELECT last_name, r_upper(last_name) AS r_upper FROM customer;
 

-- The same function as r_upper, but using a faster vector operation
--/
CREATE R SCALAR SCRIPT r_upper_vectorized(input_word VARCHAR(50))
RETURNS VARCHAR(50) AS
   run <- function(ctx) {
   # read all records in the current buffer into a single vector.
   # the buffer holds any number of records from the table
   ctx$next_row(NA) 
   toupper(ctx$input_word)
}
/
-- A simple SET-EMITS function, computing simple statistics on a column
--/
CREATE R SET SCRIPT r_stats(group_id INT, input_number DOUBLE)
EMITS (group_id INT, mean DOUBLE, stddev DOUBLE) AS
   run <- function(ctx) {
   # fetch all records from this group into a single vector
   ctx$next_row(NA)
   ctx$emit(ctx$group_id[1], mean(ctx$input_number), sd(ctx$input_number))
}
/

SELECT r_stats(groupId, num) FROM keyvalues
GROUP BY groupId ORDER BY group_id;

/*
  The following example loads from a webserver
  and processes the following file goalies.xml:

  <?xml version='1.0' encoding='UTF-8'?>
  <users>
    <user active="1">
      <first_name>Manuel</first_name>
      <last_name>Neuer</last_name>
    </user>
    <user active="1">
      <first_name>Joe</first_name>
      <last_name>Hart</last_name>
    </user>
    <user active="0">
      <first_name>Oliver</first_name>
      <last_name>Kahn</last_name>
    </user>
  </users>
*/
 
--/
CREATE R SCALAR SCRIPT process_users(url VARCHAR(500))
EMITS (firstname VARCHAR(100), lastname VARCHAR(100)) AS
   require('RCurl')
   require('XML')
  
   run <- function(ctx) {
   cont <- getURL(ctx$url)
   tree <- xmlTreeParse(cont)
   for (i in 1:length(tree$doc$children$users)) {
    if (tree$doc$children$users[i]$user$attributes['active']==1) {
       firstname <- tree$doc$children$users[i]$user$children$first_name
       lastname <- tree$doc$children$users[i]$user$children$last_name
       ctx$emit(firstname$children$text$value,
        lastname$children$text$value)
    }
  }
}
/
 
SELECT process_users ('http://www.my_valid_webserver/goalies.xml');