In this tutorial you will learn how to do regression analysis in Exasol using a UDF written in R.
This tutorial uses database objects in the free public demo system hosted by Exasol. To get access to the public demo system, sign up here.
Introduction
In this tutorial you will learn how to perform a simple regression analysis in Exasol by displaying JSON data stored in a column as a table. You will create a user defined function (UDF) in the R programming language and use this on an already trained model with sample data.
The tutorial will demonstrate that there is no need to export data from Exasol to a different system to apply machine learning models. Everything can be done using UDFs directly within the Exasol database where the data is stored.
To follow the tutorial you should have basic R programming knowledge and a basic understanding of Exasol and of what UDFs are. The tutorial does not require a deep understanding of data science or machine learning methods.
Model testing with UDFs
A machine learning model is typically designed in multiple steps: load the data -> normalize measures -> build the model -> train the model -> refine parameters -> evaluate performance. In this tutorial we have already developed an adequate model and trained it accordingly. We will focus on the final two steps, which is running the model on the test data and evaluating its performance.
For an introduction to Exasol, see Get Started.
For more information about user defined functions (UDFs), see User Defined Functions (UDFs).
Procedure
You can choose to write the R UDFs in your SQL client, or you can create them in an R development environment using the r-exasol package. The advantage of creating the UDFs in an R environment is that you can play around with dataframes in R and visualize the results in the IDE. Both methods are shown in the tutorial, and the result will be the same regardless of which method you choose.
About the data set
Imagine that you are planning a conference in Phoenix, AZ with a large number of attendees from all over the US. In this scenario, you may want to advise the attendees on which weekday will be the best for them to fly back in order to get a flight with minimum delay.
The data set used in this tutorial contains historical data on domestic flights in the US over a number of years, including information about delays. The public demo database already contains a random forest model trained on a sample from this data set, based on the airline, flight number, and day of the week. In the tutorial you will learn how to use this model to predict the delay for flights on specific weekdays.
R scripts must be compiled with Unix/Linux line endings (LF). If you are using an SQL client on Windows, you must set the linefeed to LF in the client before executing R scripts. To learn more, refer to the documentation for the respective client.
Step 1: Connect to the Exasol public demo system
SQL client
Connect to the system using the SQL client of your choice. Enter the credentials you received when you signed up for the demo.
                                                                         
                                                                    
R environment
In this tutorial we use RStudio with R version 3.5.3, but you can use any environment for R.
Install and configure the Exasol R package
The Exasol R package enables you to interact with an Exasol database directly from your R environment. The package is optimized for reading and writing data from and to a multi-node cluster through the  exa.readData() and exa.writeData() functions. For more information about prerequisites and installation of the Exasol R package, see the documentation in the r-exasol repository on GitHub.
You can install the r-exasol package through devtools. If devtools is not installed, you must first install and load it. 
# install devtools if not already installed
install.packages("devtools")
# install Exasol package using devtools
devtools::install_github("exasol/r-exasol")
# load Exasol library
library(exasol)
Connect to the database
To establish a connection to the database, use the dbConnect function. 
Replace <MY_USERNAME> and <MY_PASSWORD> in the example with the credentials you received when you signed up for the demo.
# create connection    
exaconn <- dbConnect("exa", # EXAdriver object
    exahost = "demodb.exasol.com:8563", # hostname or IP address of Exasol cluster
    uid = "<MY_USERNAME>", # database username
    pwd = "<MY_PASSWORD>", # database password
    encryption = "Y" # enforce encryption
)
To open your schema in the Exasol database in R, use dbGetQuery. 
Replace <MY_USERNAME> in the following example with your database username.
dbGetQuery(exaconn, "OPEN SCHEMA <MY_USERNAME>")Step 2: Examine the data
To understand the data structure in our example, look at the FLIGHTS table in the FLIGHTS schema.
SQL client
-- count the number of rows in the table
select count(*) from FLIGHTS.FLIGHTS;
-- get a preview of the test data
select * from FLIGHTS.FLIGHTS LIMIT 10; R environment
# count the number of rows in the table
dbGetQuery(exaconn, "select count(*) from FLIGHTS.FLIGHTS")
# get a preview of the test data
dbGetQuery(exaconn, "select * from FLIGHTS.FLIGHTS LIMIT 10")Step 3: Create an R UDF for flight delay prediction
When you have established a connection to the database and you understand the structure of the test data, the next step is to create a UDF to predict delays based on the data.
In this tutorial you will create a UDF called FLIGHTS_PRED_DEP_DELAY. The input parameters for this UDF are the columns that should be considered for delay prediction, which are the following:
- AIRLINE_ID (the airline ID)
- FLIGHT_NUM (the flight number)
- DAY_OF_WEEK (the day of week for the flight)
For every input, the UDF will output a line containing the input parameters plus the predicted delay. Instead of creating a SCALAR RETURNS function, we use a SET EMITS function here. This means that the function will not only get a single row as input but multiple rows at once. The GROUP BY statement determines the number of rows each instance of the UDF works on, more details will follow in the next section.
BucketFS
Exasol systems use a distributed file system called BucketFS. A BucketFS service contains buckets that can store files, and each bucket can have different access privileges. 
        In your UDF you will use a model which is stored in a publically readable bucket in the demo system, which means that you do not have to define a password. The model will then be applied to data that is read into a data frame object. 
SQL client
Use the following command to open your schema on the demo system. Replace <YOUR_USER_NAME> with your username.
NOTE: The schema is only set for the current session. If you disconnect from the demo system and then reconnect, you have to execute this statement again.
-- open your own schema 
 OPEN SCHEMA <YOUR_USER_NAME>;CREATE OR REPLACE R SET SCRIPT 
    "FLIGHTS_PRED_DEP_DELAY" ("DAY_OF_WEEK" DECIMAL(18,0), 
    "AIRLINE_ID" DECIMAL(18,0), 
    "FLIGHT_NUM" DECIMAL(18,0), 
    "DEST" CHAR(3) UTF8)
    EMITS (
        "AIRLINE_ID" DECIMAL(18,0), 
        "FLIGHT_NUM" DECIMAL(18,0), 
        "DAY_OF_WEEK" DECIMAL(18,0), 
        "DEST" VARCHAR(200000) UTF8, 
        "PREDICTED_DELAY_IN_MINUTES" DECIMAL(16,14))
    AS
        run <- function(ctx) {
            # load the required package
            require(randomForest)
            # load the flightModel from BucketFS and unserialize it for use
            f <- file("/buckets/bucketfs1/demo_flights/flightModel", open="r", raw=TRUE)
            m <- unserialize(f)
            # work on the data in chunks of 1000 elements
            repeat {
            if(!ctx$next_row(1000))
            break
            # load data into dataframe
            df <- data.frame(
            DAYOFWEEK = ctx$DAY_OF_WEEK,
            AIRLINE_ID = ctx$AIRLINE_ID,
            FLIGHTNUM = ctx$FLIGHT_NUM,
            DEST = ctx$DEST
            )
            # predict the delay and emit (create the result set)
            prediction <- predict(m, newdata = df)
            ctx$emit(df$DAYOFWEEK,df$AIRLINE_ID,df$FLIGHTNUM,df$DEST,prediction)
        }
    }
/                R environment
The function exa.createScript() deploys R code from our R environment into the Exasol database. The function takes the following arguments:
- 
                                                                            The connection object ( exaconn)
- 
                                                                            The name of the UDF ( FLIGHTS_PRED_DEP_DELAY)
- 
                                                                            The R function 
- 
                                                                            The input arguments, with the corresponding data type 
- 
                                                                            The output arguments, with the corresponding data type 
                    The argument replaceIfExists = TRUE is optional but recommended. It ensures that this version of the UDF will replace any former versions that may exist in the database.
PredictInExasolFinal <- exa.createScript(
    exaconn,
    "FLIGHTS_PRED_DEP_DELAY",
    function(ctx) {
    # load the required package
    require(randomForest)
    # load the flightModel from BucketFS and unserialize it for use
    f <- file("/buckets/bucketfs1/demo_flights/flightModel", open="r", raw=TRUE)
    m <- unserialize(f)
    # load the data in chunks and save it in a dataframe
    repeat {
    if(!ctx$next_row(1000))
    break
    # put data into dataframe
    df <- data.frame(
    DAYOFWEEK = ctx$DAY_OF_WEEK,
    AIRLINE_ID = ctx$AIRLINE_ID,
    FLIGHTNUM = ctx$FLIGHT_NUM,
    DEST = ctx$DEST
    )
    prediction <- predict(m, newdata = df)
    ctx$emit(df$DAYOFWEEK,df$AIRLINE_ID,df$FLIGHTNUM,df$DEST,prediction)
    }
    },
    # input arguments
    inArgs = c(
    "DAY_OF_WEEK DECIMAL(18,0)",
    "AIRLINE_ID DECIMAL(18,0)",
    "FLIGHT_NUM DECIMAL(18,0)",
    "DEST CHAR(3)"
    ),
    # output arguments
    outArgs = c(
    "DAY_OF_WEEK DECIMAL(18,0)",
    "AIRLINE_ID DECIMAL(18,0)",
    "FLIGHT_NUM DECIMAL(18,0)",
    "DEST VARCHAR(200000)",
    "PREDICT DOUBLE"
    ),
    # replace existing version (optional)
    replaceIfExists = TRUE
)Step 4: Let the model predict
The next step is to create the view PRED_DELAY, which will enable access to the prediction data. The UDF is executed every time the PRED_DELAY view is queried. This means that if new rows are added the the FLIGHTS table, the view will work on the updated data when it is queried.
You want to run our UDF for each combination of airline id, flight number, day of the week, and destination. Additionally, you are only interested in flights departing from PHX. Therefore, you include a subselect to execute the UDF on the appropriate data set.
To execute the UDF in parallel, you will use a GROUP BY statement together with the IPROC() function, which returns a different number for each node in the cluster. The GROUP BY statement will start one process for each group. This means that GROUP BY IPROC() will start as many instances of the UDF as we have nodes.
SQL client
Create a view that contains the subset of data that should be used by the UDF:
CREATE OR REPLACE VIEW FLIGHTS_PRED_DELAY_PHX AS (
    WITH FLIGHTS_PRE_SELECTION AS (
        SELECT AIRLINE_ID,FLIGHT_NUM,DAY_OF_WEEK,DEST
        FROM FLIGHTS.FLIGHTS
        WHERE ORIGIN = 'PHX'
        GROUP BY AIRLINE_ID,FLIGHT_NUM,DAY_OF_WEEK,DEST
    )
    SELECT "FLIGHTS_PRED_DEP_DELAY" (DAY_OF_WEEK, AIRLINE_ID,FLIGHT_NUM,DEST)
    FROM FLIGHTS_PRE_SELECTION
    GROUP BY IPROC()
    );When you query this view, the UDF will run and you should get the following result:
If you want to persist your results, use the following statement:
CREATE TABLE FLIGHT_PREDICTION_PERSISTED AS (
    SELECT * FROM FLIGHTS_PRED_DELAY_PHX
    );R environment
Create a table that contains the subset of data that should be used by the UDF:
dbGetQuery(exaconn, "CREATE OR REPLACE TABLE FLIGHTS_PRE_SELECTION as select DAY_OF_WEEK,AIRLINE_ID,FLIGHT_NUM,DEST from FLIGHTS.FLIGHTS where ORIGIN = 'PHX' group by DAY_OF_WEEK,AIRLINE_ID,FLIGHT_NUM,DEST")Execute the UDF with the table that you just created:
prediction_output = PredictInExasolFinal(
"DAY_OF_WEEK",
"AIRLINE_ID",
"FLIGHT_NUM",
"DEST",
table="FLIGHTS_PRE_SELECTION",
groupBy = "iproc()"
)You should get the following result:
Step 5: Visualize the data
You can now visualize the data that you predicted with the UDF by connecting your favorite data visualization tool to Exasol. You can also create a visualization directly in R.
R environment visualization
The following example draws a bar plot on mean of prediction values grouped by day of week.
install.packages("data.table")
    library(data.table)
    DT <- data.table(prediction_output)
    DT <- DT[, mean(PREDICT), by = DAY_OF_WEEK]
    DT
    install.packages("ggplot2")
    library(ggplot2)
    # Basic barplot
    p<-ggplot(data=DT, aes(x=DAY_OF_WEEK, y=V1)) +
    geom_bar(stat="identity", fill="steelblue")
    print(p)
                                                                     
                                                                

