Using Python 3 in UDFs

Learn how to use Python 3 in user defined functions (UDFs).

Python 2 has reached end of life and is no longer supported. We recommend that you use Python 3 in all Python UDFs.

To learn more about the Python programming language, refer to the official Python Documentation.

Parameters

The internal Python data types are not identical to SQL data types, therefore casts must be done for the input and output data.

SQL data type	Python 3 data type
`DECIMAL(p,0)`	`int`
`DECIMAL(p,s)`	`decimal.Decimal`
`DOUBLE`	`float`
`DATE`	`datetime.date`
`TIMESTAMP`	`datetime.datetime`
`BOOLEAN`	`bool`
`VARCHAR` and `CHAR`	`str`

Usage notes

The input parameters can be addressed by their names, for example, ctx.my_input. In this case, ctx refers to the name of the context parameter that is passed to the run() method.

You can also use a dynamic number of parameters by using the notation (...). The parameters can then be accessed through an index where ctx[0] is the first parameter. The number of parameters and their data types will be part of the metadata and determined when the script is called. For example: CREATE PYTHON3 SCALAR SCRIPT my_script (...).
For better performance, use DOUBLE instead of DECIMAL.
The data type datetime.datetime only supports six fractional digits, which means that TIMESTAMP input/output values are truncated to six fractional digits even if TIMESTAMP(7), TIMESTAMP(8), or TIMESTAMP(9) are used.
The value None is the equivalent of the SQL NULL.

Metadata

You can access the following metadata through global variables:

Metadata	Description
`exa.meta.current_schema`	Schema which is currently opened
`exa.meta.current_user`	Current user
`exa.meta.database_name`	Database name
`exa.meta.database_version`	Database version
`exa.meta.input_column_count`	Number of input columns
`exa.meta.input_columns[]`	Array including the following information: {name, type, sql_type, size, precision, scale}. For more details, see Input/output columns.
`exa.meta.input_type`	Type of the input data (SCALAR or SET)
`exa.meta.memory.limit`	Memory limit of the running UDF process
`exa.meta.node_count`	Number of cluster nodes
`exa.meta.node_id`	Local node ID starting with 0
`exa.meta.output_column_count`	Number of output columns
`exa.meta.output_columns[]`	Array including the following information: {name, type, sql_type, size, precision, scale}. For more details, see Input/output columns.
`exa.meta.output_type`	Type of the output data (RETURNS or EMITS)
`exa.meta.scope_user`	Scope user (`current_user` or the owner of a view if the udf script is called within a view)
`exa.meta.script_code`	Code of the script
`exa.meta.script_language`	Name and version of the script language
`exa.meta.script_name`	Name of the script
`exa.meta.script_schema`	Schema in which the script is stored
`exa.meta.session_id`	Session ID
`exa.meta.statement_id`	Statement ID within the session
`exa.meta.vm_id`	Unique ID for the local machine (the IDs of the virtual machines have no relation to each other)

Input/output columns

This table describes the columns in exa.meta.input_columns[] and exa.meta.output_columns[].

Column	Description	Python data type
name	Name of column as UTF-8 string (required)	string
type	Column type (optional) - see the following table	(enum)
sql_type	Type name in Exasol (required)	string
size	Size for `CHAR` and `VARCHAR` types	int
precision	Precision for `DECIMAL` types	int
scale	Scale for `DECIMAL` types	int

Enumerated types in the type column

Column type	SQL data type
0	(unsupported)
1	`FLOAT`
2	`DECIMAL(4, 0)`
3	`DECIMAL(8, 0)`
4	All other numeric types
5	`TIMESTAMP`
6	`DATE`
7	`CHAR` or `VARCHAR`
8	`BOOL`

Callbacks

Dynamic output parameters callback

If a UDF script is defined with dynamic output parameters and the output parameters cannot be determined by specifying EMITS in the query or by INSERT INTO SELECT, the database calls the function default_output_columns() which you can implement. The expected return value is a string with the names and types of the output columns. For example: "a int, b varchar(100)".

The callback will be executed only once on a single node.

For more details and examples, see Dynamic input and output parameters

To find out the number and types of input columns, you can access the metadata exa.meta in the method.

User defined import callback

To support a user defined import you can implement the callback function generate_sql_for_import_spec(import_spec).

The parameter import_spec contains all information about the executed IMPORT FROM SCRIPT statement. The function has to generate and return a SELECT statement that will retrieve the data to be imported. The import_spec parameter has the following fields:

Field	Description
`parameters[]`	Parameters specified in the `IMPORT` statement. For example, `parameters['FOO']` returns either the value of parameter `FOO`, or `None`, if FOO was not specified.
`is_subselect`	This is true if `IMPORT` is used inside a `SELECT` statement and not inside an `IMPORT INTO` table statement.
`subselect_column_names[]`	Returns the names of all specified columns if `is_subselect` is true and the user specified the target column names and types.
`subselect_column_types[]`	Returns the names of all specified columns if `is_subselect` is true and the user specified the target column names and types. The types are returned in SQL format (`"VARCHAR(100)"`).
`connection_name`	Returns the name of the connection if it was specified, otherwise it returns `None`. The UDF script can then obtain the connection information through `exa.get_connection(name)`.
`connection`	This is only defined if the user provided connection information. It returns an object similar to the return value of `exa.getConnection(name)`. Returns `None` if no connection information is specified.

The password is transferred as cleartext and can be visible in the logs. For security reasons, we recommend that you always use CREATE CONNECTION to create a connection object to store connection strings and credentials, and in the script only specify the the name of the connection, which can be obtained from the connection_name field. The actual connection information can then be obtained through exa.get_connection(name).

User defined export callback

To support a user defined export you can implement the callback function generate_sql_for_export_spec(export_spec).

The parameter export_spec contains all information about the executed EXPORT INTO SCRIPT statement. The function has to generate and return a SELECT statement that will generate the data to be exported. The FROM part of that string can be a dummy table (DUAL), since the export command is aware which table should be exported. However, FROM must be specified to be able to compile the SQL string.

The export_spec parameter has the following fields:

Field	Description
`parameters[]`	Parameters specified in the `EXPORT` statement. For example, `parameters['FOO']` returns either the value of parameter `FOO`, or `None`, if `FOO` was not specified.
`source_column_names[]`	List of column names of the resulting table that should be exported.
`has_truncate`	Boolean value from the `EXPORT` command option that defines whether the content of the target table should be truncated before the data transfer.
`has_replace`	Boolean value from the `EXPORT` command option that defines whether the target table should be deleted before the data transfer.
`created_by`	String value from the `EXPORT` command option that defines the creation text executed in the target system before the data transfer.
`connection_name`	Returns the name of the connection if it was specified, otherwise it returns `None`. The UDF script can then obtain the connection information through `exa.get_connection(name)`.
`connection`	This is only defined if the user provided connection information. It returns an object similar to the return value of `exa.getConnection(name)`. Returns `None` if no connection information is specified.

Example:

Copy

/*
This example loads from a webserver
and processes the following file goalies.xml:
<?xml version='1.0' encoding='UTF-8'?>
<users>
    <user active="1">
        <first_name>Manuel</first_name>
        <last_name>Neuer</last_name>
    </user>
    <user active="1">
        <first_name>Joe</first_name>
        <last_name>Hart</last_name>
    </user>
    <user active="0">
        <first_name>Oliver</first_name>
        <last_name>Kahn</last_name>
    </user>
</users>
*/
-- /
CREATE PYTHON3 SCALAR SCRIPT process_users(url VARCHAR(500))
EMITS(firstname VARCHAR(20), lastname VARCHAR(20)) AS
import urllib.request
import lxml.etree as etree

 def run(ctx):
    data = b''.join(urllib.request.urlopen(ctx.url).readlines())
    tree = etree.XML(data)
    for user in tree.findall('user/[@active="1"]'):
        fn = user.findtext('first_name')
        ln = user.findtext('last_name')
        ctx.emit(fn, ln)
/
SELECT process_users ('http://www.my_valid_webserver/goalies.xml')
FROM DUAL;

Adapter script callback

For virtual schemas, an adapter script must define the function adapter_call(request_json). The parameter is a string in JSON containing the virtual schema API request. The return value must also be a string in JSON containing the response.

The callback will be executed only once on a single node.

For more information about the virtual schema API, see Information for Developers.

Preprocessor script callback

For SQL preprocessing, a preprocessor script must define the function adapter_call(sql_statement). The parameter is a string containing the original SQL statement. The return value must be a string containing the preprocessed SQL statement.

The callback will be executed only once on a single node.

To learn more about the SQL preprocessor, see SQL preprocessor.

Callback functions run() and cleanup()

The run() callback function is called for each input tuple (SCALAR) or each group (SET). Its parameter is a type of execution context that provides access to the data and the iterator in case of a SET script.To initialize expensive steps (such as opening external connections) you can write code outside the run() function, since this code is executed once by each virtual machine at the start of execution.

The cleanup() function can be used for deinitialization purposes. It is called once for each virtual machine at the end of execution.

Method emit()

You can return multiple output tuples per call (keyword EMITS) by using the method emit(). This method expects as many parameters as output columns were defined. In the case of dynamic output parameters, you can use a list object that can be unpacked using *. For example: ctx.emit(*currentRow).

Data iterator

For scripts having multiple input tuples per call (keyword SET), you can iterate through that data using the method next(), which is accessible through the context. Initially, the iterator points to the first input row. For iterating you can use a while True loop which is aborted in case if not ctx.next().

If the input data is empty, then the run() method will not be called, and similar to aggregate functions the NULL value is returned as result (for example, SELECT MAX(x) FROM t WHERE false).

The method reset() resets the iterator to the first input element. This allows you to do multiple iterations through the data, if this is required for your algorithm.

The method size() returns the number of input values.

Access connection definitions

The data that has been specified when defining connections with CREATE CONNECTION is available in Python UDF scripts through the method exa.get_connection("<connection name>"). The result is a Python object with the following fields:

Field	Description
`type`	The type of the connection definition. `ConnectionType` is an enumeration that currently only contains the entry PASSWORD
`address`	The part of the connection definition that followed the `TO` keyword in the `CREATE CONNECTION` command
`user`	The part of the connection definition that followed the `USER` keyword in the `CREATE CONNECTION` command
`password`	The part of the connection definition that followed the `IDENTIFIED BY` keyword in the `CREATE CONNECTION` command

Import other scripts

You can import other scripts through the method exa.import_script(). The return value of this method must be assigned to a variable that represents the imported module.

Syntax

Copy

<alias> = exa.import_script('<schema>.<script>')

Examples

Copy

CREATE SCHEMA IF NOT EXISTS TEST;
--/
CREATE OR REPLACE PYTHON3 SCALAR SCRIPT TEST.PYTHON_DEMO() RETURNS VARCHAR(2000) AS
def run(ctx):
    return "Minimal Python UDF"
/
select TEST.PYTHON_DEMO();
CREATE SCHEMA IF NOT EXISTS LIB;
--/
CREATE OR REPLACE PYTHON3 SCALAR SCRIPT LIB.MYLIB() RETURNS INT AS
def helloWorld():
    return "Hello Python3 World!"
/
CREATE SCHEMA IF NOT EXISTS TEST;
--/
CREATE OR REPLACE PYTHON3 SCALAR SCRIPT TEST.MYHELLOWORLD() RETURNS VARCHAR(2000) AS
l = exa.import_script('LIB.MYLIB')
def run(ctx):
    return l.helloWorld()
/
select TEST.MYHELLOWORLD();

Auxiliary libraries

The following libraries that are not already part of the language are provided by Exasol. For more information about a library, refer to the documentation from the respective developer.

To get the full list of added packages, you can use the UDF python3_info in the file language_info.sql on GitHub.

Library	Description
`lxml`	XML processing
`NumPy`	Numeric calculations
`PyTables`	Hierarchical database package
`pytz`	Time zone functions
`redis`	Interface for Redis
`scikit-learn`	Machine Learning
`SciPy`	Scientific tools
`ujson`	UltraJSON is an ultra fast JSON encoder and decoder written in pure C with bindings for Python 2.5+ and 3.
`pyexasol`	Official Python driver for Exasol. For more information, see PyExasol.
`requests`	Standard for making HTTP requests in Python
`pycurl`	Can be used to fetch objects identified by a URL from a Python program
`boto3`	AWS SDK for Python
`ldap`	Python-ldap provides an object-oriented API to access LDAP directory servers from Python programs.
`roman`	Converts an integer to a roman numeral
`OpenSSL`	A python wrapper module around the OpenSSL library
`smbc`	Binding for Samba client library libsmbclient
`leveldb`	Binding for the key-value database LevelDB
`pyodbc`	Database API module for ODBC
`pandas`	Data structures and data-analysis tools for working with structured and time-series data
`pycparser`	Parser for the C language
`cffi`	C Foreign Functions Interface to interact with C code from Python
`protobuf`	Google's platform-neutral mechanism for serializing structured data
`pykickstart`	Library for reading and writing kickstart files
`martian`	Library for embedding configuration information in Python code

Activate Python 3 in Exasol <6.2

In a new installation of Exasol version 6.2 or later, Python 3 is activated by default. In a system that was installed with a version prior to 6.2 and then updated to a later version, Python 3 must be explicitly activated.

To check if Python 3 is active in your database, run the following query:

Copy

SELECT p.system_value 
FROM exa_parameters p
WHERE p.parameter_name in ('SCRIPT_LANGUAGES')
;

If the result contains PYTHON3=builtin_python3, Python 3 is active in your system and you do not have to make any changes. If the result does not contain this string, you must explicitly activate Python 3. To do this, append a space and PYTHON3=builtin_python3 to the value returned by the query, then use ALTER SYSTEM SET SCRIPT_LANGUAGES to update the parameter. For example:

Copy

ALTER SYSTEM SET SCRIPT_LANGUAGES = 'PYTHON=builtin_python R=builtin_r JAVA=builtin_java PYTHON3=builtin_python3';

We recommend that you test the changes using ALTER SESSION before making system-wide changes using ALTER SYSTEM.

Using Python 3 in UDFs

Parameters

Usage notes

Metadata

Input/output columns

Callbacks

Dynamic output parameters callback

User defined import callback

User defined export callback

Example:

Adapter script callback

Preprocessor script callback

Callback functions run() and cleanup()

Method emit()

Data iterator

Access connection definitions

Import other scripts

Syntax

Examples

Auxiliary libraries

Activate Python 3 in Exasol <6.2

PRODUCT

RESOURCES