Python 3

This article explains how to use Python 3 for UDF scripting in Exasol.

For additional information about Python, refer to the official Python Documentation.

run() and cleanup()

The method run() is called for each input tuple (SCALAR) or each group (SET). Its parameter is a type of execution context and provides access to the data and the iterator in case of a SET script.

To initialize expensive steps (such as opening external connections), you can write code outside the run() method, since this code is executed once by each virtual machine at the start of execution. For deinitialization purposes, the method cleanup() is available. This method is called once for each virtual machine at the end of execution.

Parameters

The internal Python data types are not identical to SQL data types. Therefore, casts must be done for the input and output data.

SQL data type Python 3 data type
DECIMAL(p,0) int
DECIMAL(p,s) decimal.Decimal
DOUBLE float
DATE datetime.date
TIMESTAMP datetime.datetime
BOOLEAN bool
VARCHAR and CHAR str

Usage notes

  • The input parameters can be addressed by their names, for example, ctx.my_input. In this case, ctx refers to the name of the context parameter that is passed to the run() method.

    You can also use a dynamic number of parameters by using the notation (...). The parameters can then be accessed through an index where ctx[0] is the first parameter. The number of parameters and their data types will be part of the metadata and determined when the script is called. For example: CREATE PYTHON3 SCALAR SCRIPT my_script (...).

  • For better performance, use DOUBLE instead of DECIMAL.

  • The data type datetime.datetime only supports six fractional digits, which means that TIMESTAMP input/output values are truncated to six fractional digits even if TIMESTAMP(7), TIMESTAMP(8), or TIMESTAMP(9) are used.

  • The value None is the equivalent of the SQL NULL.

Metadata

You can access the following metadata through global variables:

Metadata Description

exa.meta.database_name

Database name

exa.meta.database_version

Database version

exa.meta.script_language

Name and version of the script language

exa.meta.script_name

Name of the script

exa.meta.script_schema

Schema in which the script is stored

exa.meta.current_schema

Schema which is currently opened

exa.meta.script_code

Code of the script

exa.meta.session_id

Session ID

exa.meta.statement_id

Statement ID within the session

exa.meta.current_user

Current user

exa.meta.scope_user

Scope user (current_user or the owner of a view if the udf script is called within a view)

exa.meta.node_count

Number of cluster nodes

exa.meta.node_id

Local node ID starting with 0

exa.meta.vm_id

Unique ID for the local machine (the IDs of the virtual machines have no relation to each other)

exa.meta.input_type

Type of the input data (SCALAR or SET)

exa.meta.input_column_count

Number of input columns

exa.meta.input_columns[]

Array including the following information: {name, type, type_name, size, precision, scale}. For more details, see Input/output columns.

exa.meta.output_type

Type of the output data (RETURNS or EMITS)

exa.meta.output_column_count

Number of output columns

exa.meta.output_columns[]

Array including the following information: {name, type, type_name, size, precision, scale}. For more details, see Input/output columns.

Input/output columns

This table describes the columns in exa.meta.input_columns[] and exa.meta.output_columns[].

Column Description Python data type
name Name of column as UTF-8 string (required) string
type Column type (optional) - see the following table (enum)
type_name Type name in Exasol (required) string
size Size for CHAR and VARCHAR types int
precision Precision for DECIMAL types int
scale Scale for DECIMAL types int

Enumerated types in the type column

Column type SQL data type

0

(unsupported)
1 FLOAT
2 DECIMAL(4, 0)
3 DECIMAL(8, 0)
4 All other numeric types
5 TIMESTAMP
6 DATE
7 CHAR or VARCHAR
8 BOOL

Data iterator

For scripts having multiple input tuples per call (keyword SET), you can iterate through that data using the method next(), which is accessible through the context. Initially, the iterator points to the first input row. For iterating you can use a while True loop which is aborted in case if not ctx.next().

If the input data is empty, then the run() method will not be called, and similar to aggregate functions the NULL value is returned as result (for example, SELECT MAX(x) FROM t WHERE false).

The method reset() resets the iterator to the first input element. This allows you to do multiple iterations through the data, if this is required for your algorithm.

The method size() returns the number of input values.

emit()

You can return multiple output tuples per call (keyword EMITS) using the method emit(). The method expects as many parameters as output columns were defined. In the case of dynamic output parameters, it is handy in Python to use a list object that can be unpacked using *.

For example: ctx.emit(*currentRow).

Import other scripts

You can import other scripts through the method exa.import_script(). The return value of this method must be assigned to a variable that represents the imported module.

Syntax
<alias> = exa.import_script('<schema>.<script>')
Examples
CREATE SCHEMA IF NOT EXISTS TEST;
            --/
            CREATE OR REPLACE PYTHON3 SCALAR SCRIPT TEST.PYTHON_DEMO() RETURNS VARCHAR(2000) AS
            def run(ctx):
            return "Minimal Python UDF"
            /
            select TEST.PYTHON_DEMO();
            CREATE SCHEMA IF NOT EXISTS LIB;
            --/
            CREATE OR REPLACE PYTHON3 SCALAR SCRIPT LIB.MYLIB() RETURNS INT AS
            def helloWorld():
            return "Hello Python3 World!"
            /
            CREATE SCHEMA IF NOT EXISTS TEST;
            --/
            CREATE OR REPLACE PYTHON3 SCALAR SCRIPT TEST.MYHELLOWORLD() RETURNS VARCHAR(2000) AS
            l = exa.import_script('LIB.MYLIB')
            def run(ctx):
            return l.helloWorld()
            /
            select TEST.MYHELLOWORLD();

Access connection definitions

The data that has been specified when defining connections with CREATE CONNECTION is available in Python UDF scripts through the method exa.get_connection("<connection name>"). The result is a Python object with the following fields:

Field Description

type

The type of the connection definition. ConnectionType is an enumeration that currently only contains the entry PASSWORD

address

The part of the connection definition that followed the TO keyword in the CREATE CONNECTION command

user

The part of the connection definition that followed the USER keyword in the CREATE CONNECTION command

password

The part of the connection definition that followed the IDENTIFIED BY keyword in the CREATE CONNECTION command

Auxiliary libraries

The following libraries are provided that are not already part of the language:

Library Description

lxml

XML processing. For more details, see http://pypi.python.org/pypi/lxml.

NumPy

Numeric calculations. For details, see http://www.numpy.org.

PyTables

Hierarchical database package. For details, see http://www.pytables.org.
For this library, the required build package HDF5 is available at http://www.h5py.org.

pytz

Time zone functions. For details, see http://pytz.sourceforge.net.

redis

Interface for Redis. For details, see http://pypi.python.org/pypi/redis/.

scikit-learn

Machine Learning. For details, see http://scikit-learn.org.

SciPy

Scientific tools. For details, see http://www.scipy.org. For this library, the required build tool atlas is available at http://pypi.python.org/pypi/atlas.

ujson

UltraJSON is an ultra fast JSON encoder and decoder written in pure C with bindings for Python 2.5+ and 3. For more details, see https://pypi.org/project/ujson/.

pyexasol

Official Python driver for Exasol. For more details, see https://github.com/exasol/pyexasol.

requests

Standard for making HTTP requests in Python. For more details, see https://pypi.org/project/requests/.

pycurl

Can be used to fetch objects identified by a URL from a Python program. For more details, see https://pypi.org/project/pycurl/.

boto3 Boto3 is the AWS SDK for Python. For more details, see https://pypi.org/project/boto3/.
boto

Boto is the older AWS SDK for Python which was succeded by Boto3. This package is deprecated and not recommended. For more details, see https://pypi.org/project/boto/.

ldap Python-ldap provides an object-oriented API to access LDAP directory servers from Python programs. For more details, see https://www.python-ldap.org/en/latest/.
roman Converts an integer to a roman numeral. For more details, see https://pypi.org/project/roman/.
OpenSSL A python wrapper module around the OpenSSL library. For more details, see https://www.pyopenssl.org/en/stable/.
smbc Binding for Samba client library libsmbclient. For more details, see https://pypi.org/project/pysmbc/.
leveldb Binding for the key-value database LevelDB. For more details, see https://code.google.com/archive/p/py-leveldb/.
pyodbc Database API module for ODBC. For more details, see https://github.com/mkleehammer/pyodbc/wiki.
pandas Data structures and data-analysis tools for working with structured and time-series data. For more details, https://pandas.pydata.org.
pycparser Parser for the C language. For more details, see https://github.com/eliben/pycparser.
cffi C Foreign Functions Interface to interact with C code from Python. For more details, see https://cffi.readthedocs.io/en/latest/.
protobuf Google's platform-neutral mechanism for serializing structured data. For more details, see https://developers.google.com/protocol-buffers/.
pykickstart Library for reading and writing kickstart files. For more details, see https://pykickstart.readthedocs.io/en/latest/.
martian Library for embedding configuration information in Python code. For more details, see https://pypi.org/project/martian/.

Dynamic output parameters callback

If the UDF script is defined with dynamic output parameters and the output parameters cannot be determined (by specifying EMITS in the query or by INSERT INTO SELECT), the database calls the method default_output_columns() which you can implement. The expected return value is a String with the names and types of the output columns, "a int, b varchar(100)".

For more details about when this method is called, see Dynamic input and output parameters

To find out the number and types of input columns, you can access the metadata exa.meta in the method.

The method will be executed only once on a single node.

User defined import callback

To support a user defined import you can implement the callback method generate_sql_for_import_spec(import_spec). For syntax, see Dynamic input and output parameters and IMPORT statement. The parameter import_spec contains all information about the executed IMPORT FROM SCRIPT statement. The function has to generate and return a SELECT statement which will retrieve the data to be imported. import_spec has the following fields:

Field Description

parameters[]

Parameters specified in the IMPORT statement. For example, parameters['FOO'] returns either the value of parameter FOO, or None, if FOO was not specified.

is_subselect

This is true, if the IMPORT is used inside a SELECT statement and not inside an IMPORT INTO table statement.

subselect_column_names[]

If is_subselect is true and the user specified the target column names and types, this returns the names of all specified columns.

subselect_column_types[]

If is_subselect is true and the user specified the target column names and types, this returns the types of all specified columns. The types are returned in SQL format ("VARCHAR(100)").

connection_name

This returns the name of the connection, if it was specified. Otherwise it returns None. The UDF script can then obtain the connection information through exa.get_connection(name).

connection

This is only defined, if the user provided connection information. It returns an object similar to the return value of exa.getConnection(name). Returns None if no connection information is specified.

The password is transferred into plaintext and could be visible in the logs. Therefore, it is recommended to create a CONNECTION before and to specify only the connection name (can be obtained from connection_name field). The actual connection information can be obtained through exa.get_connection(name).

User defined export callback

To support a user defined export you can implement the callback method generate_sql_for_export_spec(export_spec). For syntax, see Dynamic input and output parametersand EXPORT statement. The parameter export_spec contains all information about the executed EXPORT INTO SCRIPT statement. The function has to generate and return a SELECT statement which will generate the data to be exported. The FROM part of that string can be a dummy table (DUAL) since the export command is aware which table should be exported. But it has to be specified to be able to compile the SQL string successfully.

export_spec has the following fields:

Field Description

parameters[]

Parameters specified in the EXPORT statement. For example, parameters['FOO'] returns either the value of parameter FOO, or None, if FOO was not specified.

source_column_names[]

List of column names of the resulting table that should be exported.

has_truncate

Boolean value from the EXPORT command option that defines whether the content of the target table should be truncated before the data transfer.

has_replace

Boolean value from the EXPORT command option that defines whether the target table should be deleted before the data transfer.

created_by

String value from the EXPORT command option that defines the creation text executed in the target system before the data transfer.

connection_name

This returns the name of the connection, if it was specified. Otherwise it returns None. The UDF script can then obtain the connection information through exa.get_connection(name).

connection

This is only defined if the user provided connection information. It returns an object similar to the return value of exa.getConnection(name). Returns None if no connection information is specified.

The password is transferred into plaintext and can be visible in the logs. We recommended that you create a CONNECTION to specify only the connection name, which can be obtained from connection_name field. The actual connection information can be obtained through exa.get_connection(name).

Example
/*
            This example loads from a webserver
            and processes the following file goalies.xml:
            <?xml version='1.0' encoding='UTF-8'?>
            <users>
            <user active="1">
            <first_name>Manuel</first_name>
            <last_name>Neuer</last_name>
            </user>
            <user active="1">
            <first_name>Joe</first_name>
            <last_name>Hart</last_name>
            </user>
            <user active="0">
            <first_name>Oliver</first_name>
            <last_name>Kahn</last_name>
            </user>
            </users>
            */
            --/
            CREATE PYTHON3 SCALAR SCRIPT process_users(url VARCHAR(500))
            EMITS (firstname VARCHAR(20), lastname VARCHAR(20)) AS
            import urllib.request
            import lxml.etree as etree
            def run(ctx):
            data = b''.join(urllib.request.urlopen(ctx.url).readlines())
            tree = etree.XML(data)
            for user in tree.findall('user/[@active="1"]'):
            fn = user.findtext('first_name')
            ln = user.findtext('family_name')
            ctx.emit(fn, ln)
            /
        

Adapter script callback

For virtual schemas an adapter script must define the function adapter_call(request_json). The parameter is a string in JSON containing the Virtual Schema API request. The return value must also be a string in JSON containing the response. The callback function will be executed only on a single node.

For the Virtual Schema API Documentation, see Information for Developers.

Activate Python 3 in databases installed prior to version 6.2

In a new installation of Exasol version 6.2 or later, Python 3 is activated by default. In a system that was installed with a version prior to 6.2 and then updated to a later version, Python 3 must be explicitly activated.

To check if Python 3 is active in your database, run the following query:

SELECT p.system_value 
FROM exa_parameters p
WHERE p.parameter_name in ('SCRIPT_LANGUAGES')
;

If the result contains PYTHON3=builtin_python3, Python 3 is active in your system and you do not have to make any changes. If the result does not contain this string, you must explicitly activate Python 3. To do this, append a space and PYTHON3=builtin_python3 to the value returned by the query, then use ALTER SYSTEM SET SCRIPT_LANGUAGES to update the parameter. For example:

ALTER SYSTEM SET SCRIPT_LANGUAGES = 'PYTHON=builtin_python R=builtin_r JAVA=builtin_java PYTHON3=builtin_python3';

We recommend that you test the changes using ALTER SESSION before making system-wide changes using ALTER SYSTEM.