Accessing JSON Data with UDFs

You can use the Exasol database to work with JSON data. Since UDFs are very efficient in Exasol, you can store JSON in VARCHAR columns without the need to pre-parse it. You can parse it on the fly through a UDF when accessing the data.

This demo presents a generic Python UDF to access field values in JSON documents through path expressions.

This tutorial is designed to work with the Free Trial System.Ensure you have access to the public demo system. Else sign up here to get access to the demo system hosted by Exasol for testing the data and scripts in this tutorial.

The Dataset: Article Summary

For this tutorial, we use JSON data from "Cornell Newsroom Summarization Dataset". This open dataset evaluates news publications and contains around 1.2 million rows. For more details about this database, refer to https://summari.es/.

The database contains the following JSON fields: 

  • archive
  • date
  • density
  • coverage
  • compression
  • compression_bin
  • coverage_bin
  • density_bin

Procedure

For this demo, we will use a dataset on the publicly available Exasol demo system. If you haven't already, sign up for an account on this page in the Get a test account for ExaCloud section. The schema we will be working with is "JSON_DEMO". It contains a table called "NEWSROOM_SUMMARY" which we will use for this demo.

Step 1: Connect to the Exasol Database

As a first step, we will create a connection to the system and get familiar with the table structures.

To connect to the system, you can use the SQL Client of your choice. Exasol recommends DbVisualizer in the Pro version, an open source alternative is the tool DBeaver. Enter the credentials you received by email.

Step 2: Examine the Data

Before we start processing the JSON data with a UDF, let's have a look at the data to get a better understanding of what we are working on.

-- Examine the data:
-- how many rows does the table hold?
select count(*) from JSON_DEMO.NEWSROOM_SUMMARY;
  
-- how does the data look like?
SELECT *
FROM JSON_DEMO.NEWSROOM_SUMMARY
LIMIT 10;

Step 3: JSON_TABLE UDF

The UDF we are going to use to parse our data is called JSON_TABLE. It is publicly available in the ExaToolbox. On the demo system, we have already created this UDF for you in the schema JSON_DEMO.

A call of this function should have the following format:

SELECT JSON_TABLE(
 <json string or column>,
 <path expression>,
 <path expression>,
 ...
) EMITS (<column_name> <data_type>, <column_name> <data_type>, ...)

The first parameter should contain the column name. The EMITS clause has to be used to define the output column names and their data types.

Step 4: Parse the Data with UDF

The following path expression finds all the density values in the column JSON_STRING and returns them as a result.

SELECT JSON_DEMO.JSON_TABLE(JSON_STRING,'$.density')
EMITS (DENSITY VARCHAR(2000000)) FROM JSON_DEMO.NEWSROOM_SUMMARY;

In the above statement we have used the $ expression. The following table provides you with an overview of how to use path expressions for this UDF:

Expression Result
$ For complete result set
$.archive For archive field, similarly any field name would work here
$.ParentFieldName.ChildFieldName Object fields can also be accessed using the following path expression syntax:
$.ParentFieldName[*].ChildFieldName For all child field values

In our dataset, the publication date is an integer using the Internet Archive date format: YYYYMMDDHHMMSS. Using the following statement we can alter our system's timestamp format for our current session.

alter session set NLS_timestamp_FORMAT = 'YYYYMMDDHHMISS';

Now we can write a query that parses the complete JSON table into separate column values.

SELECT JSON_DEMO.JSON_TABLE(JSON_STRING,'$.archive', '$.date', '$.compression','$.coverage','$.density','$.compression_bin','$.density_bin')
EMITS (Archive VARCHAR(2000000), PublicationDate timestamp, Compression Double, Coverage Double, Density Double, Compression_Bin Varchar(200), density_bin varchar(200))
FROM JSON_DEMO.NEWSROOM_SUMMARY;

Details and limitations of the UDF:

  • When the JSON input is not a single document, but an array, the elements of the array can be accessed by $[*].
  • It is recommended to use the correct data types in the EMITS clause. Otherwise, casting is done which can lead to type-conversion errors.
  • When accessing non-atomic values, that is, arrays without unnesting them, or documents, they are returned as a VARCHAR containing the result in JSON format.
  • Accessing multi-dimensional arrays is not supported. This means, at most one [*] can be used in a path expression.

Step 5: Create a Table Out of JSON Data

Now that we have written a query to parse the JSON data, we can also create a table out of this data. Creating this table is useful if we want to work further on the data and find a format that we like. This query creates a new table newsroom in your own schema. Please replace <YOUR_USER_NAME> with your actual username before running the following statements.

CREATE OR REPLACE TABLE PUB2105.NEWSROOM_DATA
AS
SELECT JSON_DEMO.JSON_TABLE(JSON_STRING,'$.archive', '$.date', '$.compression','$.coverage','$.density','$.compression_bin','$.density_bin')
EMITS (Archive VARCHAR(2000000), PublicationDate timestamp, Compression Double, Coverage Double, Density Double, Compression_Bin Varchar(200), density_bin varchar(200))
FROM JSON_DEMO.NEWSROOM_SUMMARY;

SELECT * FROM PUB2105.NEWSROOM_DATA
LIMIT 10;

Step 6: Parse your Own Data

Do you have your own JSON dataset that you would like to explore in Exasol? Feel free to load it into your own schema and play around with it.

If you want to load data from an URL, refer to the below sample script: 

--small script to load data from an url and optionally split the data based on newlines
--/
create or replace python scalar script load_data_from_http
(url varchar(500),split_on_newline boolean) emits (output_data varchar(2000000)) as
def run(ctx):
    import urllib2
    response = urllib2.urlopen(ctx.url)
    data = response.read()
    if ctx.split_on_newline == True:
        lines = data.split('\n')
        for line in lines:
            if len(line) > 1:
                ctx.emit(line)
    else:
        ctx.emit(data)
/

This script will give you a varchar(2000000) field called OUTPUT_DATA with the content of the file.

Working with JSON in Exasol - The ExaToolbox

The JSON_TABLE function shown in this example is a scalar emits function. But of course, it's not the only way to work with JSON in Exasol. The ExaToolbox is a Github repository containing helpful functions to use in Exasol. It includes a dedicated section on JSON data.

The repository, for example, contains the JSON_VALUE function, which is a scalar returns function and therefore outputs one result row for every input row. Refer to the ExaToolbox for further possibilities on how to work with JSON and pick your favorite function.