R Regression Example

The intended audience for the tutorial is assumed to have a basic understanding of Exasol, in particular, user-defined functions (UDFs) and basic R programming knowledge. The demonstration does not require a deep understanding of data science or machine learning methods. It serves the purpose of trying out user-defined functions (UDFs) on an available machine learning model with a test dataset.

This tutorial is designed to work with the Free Trial System. Ensure you have access to the public demo system. Else sign up here to get access to the demo system hosted by Exasol for testing the data and scripts in this tutorial.

Model Testing with UDFs

In a typical scenario, developing a machine learning model consists of multiple steps. First, you load the data, normalize the measures if necessary, build a model, train it, refine the parameters, and evaluate the performance. To simplify the process for this demo, we assume that we have already developed an adequate model that predicts flight delays and trained it accordingly. We will focus on the last steps, running the model on the test data.

Using R to predict the flight delay, we have two different possibilities. We can either write the R UDFs in a SQL Client or create them directly out of an R development environment with the power of the r-exasol package. Both approaches will give us the same results, the advantage of the latter one is that it allows you to play around with dataframes inside R and consequently lets you visualize the results directly in your R environment. In the below sections, both these approaches are presented side by side to give you a better understanding and to help you choose the preferred approach.

The Dataset: Predicting Flight Delays

Imagine, you are planning a conference in Phoenix with a lot of attendees from all over the US. Now you want to advise them on which weekday will be the best for them to fly back in order to get a flight with a minimum delay.

The dataset we will be working with contains details about domestic flights in the US from several years. A sample of this data was also used to train a Random Forest model in R to predict the flight delays based on the airline, flight number, and the day of the week. In the following tutorial, you will learn how to use this model to predict the delay for flights on specific weekdays. You will also see that there is no need to export data from Exasol to a different system to apply machine learning models. Everything can be done using UDFs directly within Exasol where the data is stored.

We will look at historical data on domestic flights in the US and investigate their delays (>12 million rows). Based on a sample of that data, we have an already trained Random Forest model in R, deployed into the database. Now we can use this model to predict the delay for specific flights.

Procedure

For this demo, we will use a dataset on the publicly available Exasol demo system. If you haven't already, sign up for an account on this page in the Get a test account for ExaCloud section. The schema we will be working with is called "FLIGHTS. It contains multiple tables filled with training and test data.

R scripts need to be complied with LF as the linefeed. If you are using DbVisualizer or DBeaver as your client, then you need to set the linefeed to LF before executing R scripts. To know more, refer to the DbVisualizer and DBeaver topics.

Step 1: Connect to the Exasol Demo Database

As a first step, we will create a connection to the system and get familiar with the table structures. The following table provides you with the details on connecting to an SQL client and R environment.

SQL Client

To connect to the system, you can use the SQL Client of your choice. Exasol recommends DbVisualizer in the Pro version, an open source alternative is the tool DBeaver. Enter the credentials you received by email.

R Environment

For this demo, we use R studio. We have worked with R in version 3.5.3.

Prerequisites

You need to install the prerequisites for r-exasol specified in the r-exasol package.

Exasol R Setup

You can connect to, query from, and write into an Exasol database right from your R environment. The Exasol R package is optimized for reading and writing data from and to a multi-node cluster via exa.readData() and exa.writeData() functions. You can find the official documentation on the GitHub repository.

You can install R Exasol package through the devtools package. However, to do so, make sure you have installed and loaded the devtools package beforehand.

# If needed, install devtools.
install.packages("devtools")
 
# Install package using devtools.
devtools::install_github("EXASOL/r-exasol")
 
# If needed, install and load these libraries.
library(exasol)         # Exasol package
 
install.packages('RODBC')
library(RODBC)          # ODBC database connectivity

Connect to the Database

You can establish a connection to the database through the dbConnect function.

Enter the credentials you received by email before executing the below statement.

exaconn <- dbConnect(
drv = "exa", # EXAdriver object
exahost = "demodb.exasol.com:8563", # IP of database cluster
uid = "<YOUR_USER>", # Username
pwd = "<YOUR_PASSWORD>") # Password

To open a schema in the Exasol database in R, we can use the RODBC package which implements the ODBC database connectivity in R. Change the schema name to your username.

odbcQuery(exaconn, "OPEN SCHEMA <YOUR_USERNAME>");

 

 

Step 2: Examine the Data

The first thing we will do is to have a look at the data we are working with. The relevant table is called OTP and stored in the FLIGHTS schema.

SQL Client

-- Examine the data:
-- How many rows does the table hold?
select count(*) from FLIGHTS.FLIGHTS;
  
-- What does the data look like?
select * from FLIGHTS.FLIGHTS LIMIT 10; -- Get a preview of the test data.

 

R Environment

# Examine the data:
# How many rows does the table hold?
dbGetQuery(exaconn, "select count(*) from FLIGHTS.FLIGHTS")
# What does the data look like?
dbGetQuery(exaconn, "select * from FLIGHTS.FLIGHTS LIMIT 10")

 

Step 3: Creating an R UDF for Flight Delay Prediction

Now that we have seen the data let us create a UDF called FLIGHTS_PRED_DEP_DELAY to predict delays based on our data. Subsequently, the input parameters for the UDF are all the columns that should be considered for the delay prediction, which are:

  • Airline ID
  • Flight number
  • Day of week

For every input, the UDF will output a line containing the input parameters plus the predicted delay. Instead of creating a SCALAR RETURNS function, we use a SET EMITS function here. It means that the function will not only get a single row as input but multiple rows at once. The GROUP BY statement determines the number of rows each instance of the UDF works on, more details will follow in the next section.

Inside our UDF, we will use the model stored in BucketFS. As this is public readable, we can access it without defining a password. The model will then be applied to the data read into a data frame object.

SQL Client

-- Open your own schema. You will create your scripts in this schema later on.
open schema <YOUR_USER_NAME>;
 
--/
CREATE OR REPLACE R SET SCRIPT "FLIGHTS_PRED_DEP_DELAY" ("DAY_OF_WEEK" DECIMAL(18,0), "AIRLINE_ID" DECIMAL(18,0), "FLIGHT_NUM" DECIMAL(18,0), "DEST" CHAR(3) UTF8)
EMITS ("AIRLINE_ID" DECIMAL(18,0), "FLIGHT_NUM" DECIMAL(18,0), "DAY_OF_WEEK" DECIMAL(18,0), "DEST" VARCHAR(200000) UTF8, "PREDICTED_DELAY_IN_MINUTES" DECIMAL(16,14))
AS
  
    run <- function(ctx) {
     
        #Load the required package.
        require(randomForest)
         
        # Load the flightModel from showcase BucketFS and unserialise it for use.
        f <- file("/buckets/bucketfs1/demo_flights/flightModel", open="r", raw=TRUE)
        m <- unserialize(f)
         
         
        # Work on the data in chunks of 1000 elements.
        repeat {
          if(!ctx$next_row(1000))
            break
          # Load the data in dataframe.
          df <- data.frame(
             
            DAYOFWEEK = ctx$DAY_OF_WEEK,
            AIRLINE_ID = ctx$AIRLINE_ID,
            FLIGHTNUM = ctx$FLIGHT_NUM,
            DEST = ctx$DEST
             
          )
          # Predict the delay & emit (= create the result set)
          prediction <- predict(m, newdata = df)
          ctx$emit(df$AIRLINE_ID,df$FLIGHTNUM,df$DAYOFWEEK,df$DEST,prediction)
        }
}
/

R Environment

The function exa.createScript() can automatically deploy R code from our R environment into the Exasol database.

exa.createScript() gets the following arguments:

  • The connection object we created
  • The UDF's name
  • The actual function
  • The input arguments with the corresponding datatype
  • The output arguments with the corresponding datatype
  • The last argument replaceIfExists is optional but quite handy because this way you make sure that your version will replace former versions of the same UDF in the database
PredictInExasolFinal <- exa.createScript(
  exaconn,
  "FLIGHTS_PRED_DEP_DELAY_R_STUDIO",
  function(ctx) {
     
    #Load the required package.
    require(randomForest)
 
     
    # Load the flightModel from showcase BucketFS and unserialise it for use.
    f <- file("/buckets/bucketfs1/demo_flights/flightModel", open="r", raw=TRUE)
    m <- unserialize(f)
     
     
    #Load the data in chunks and save it in a dataframe.
    repeat {
      if(!ctx$next_row(1000))
        break
      # Put data in dataframe.
      df <- data.frame(
         
        DAYOFWEEK = ctx$DAY_OF_WEEK,
        AIRLINE_ID = ctx$AIRLINE_ID,
        FLIGHTNUM = ctx$FLIGHT_NUM,
        DEST = ctx$DEST
      )
       
      prediction <- predict(m, newdata = df)
       
      ctx$emit(df$AIRLINE_ID,df$FLIGHTNUM,df$DAYOFWEEK,df$DEST,prediction)
    }
     
  },
   
  #Input Arguments
   
  inArgs = c(
    "DAY_OF_WEEK DECIMAL(18,0)",
    "AIRLINE_ID DECIMAL(18,0)",
    "FLIGHT_NUM DECIMAL(18,0)",
    "DEST CHAR(3)"
  ),
   
  #Output Arguments
   
  outArgs = c(
    "DAY_OF_WEEK DECIMAL(18,0)",
    "AIRLINE_ID DECIMAL(18,0)",
    "FLIGHT_NUM DECIMAL(18,0)",
    "DEST VARCHAR(200000)",
    "PREDICT DOUBLE"
    ),
  replaceIfExists = TRUE
)

Step 4: Let the Model Predict

Next, we want to execute our UDF FLIGHTS_PRED_DEP_DELAY and evaluate the model's performance. To facilitate accessing the prediction data, for example, with a front-end tool, we create the view "PRED_DELAY". The prediction UDF is executed every time this view is queried. Therefore if new rows are added to our FLIGHTS table, the next execution of the view would automatically work with the new data.

We want to run our UDF for each combination of airline id, flight number, day of the week, and destination. Additionally, we are only interested in flights starting at Phoenix. Therefore we include a subselect to execute the UDF on the appropriate dataset.

To execute the UDF in parallel, we use a group by statement. We use the iproc() function here which returns a different number for each node in the cluster. Each group of the group by statement will start one process, what this means is that a group by iproc() will start as many instances of the UDF as we have nodes.

SQL Client

CREATE OR REPLACE VIEW FLIGHTS_PRED_DELAY_PHX as (
with FLIGHTS_PRE_SELECTION as
        (
        select  AIRLINE_ID,FLIGHT_NUM,DAY_OF_WEEK,DEST
        from    FLIGHTS.FLIGHTS
        where   ORIGIN = 'PHX'
        group by AIRLINE_ID,FLIGHT_NUM,DAY_OF_WEEK,DEST
        )
select
"FLIGHTS_PRED_DEP_DELAY" (DAY_OF_WEEK, AIRLINE_ID,FLIGHT_NUM,DEST)
from FLIGHTS_PRE_SELECTION
group by iproc()

In the following screenshot, you can see what you will get when querying the view:

If you wish to persist your results, you can do this easily with the following statement:

create table FLIGHT_PREDICTION_PERSISTED as (
SELECT * FROM FLIGHTS_PRED_DELAY_PHX
);

R Environment

We first create a table containing only a subset of data so that we can hand it over to our Prediction UDF afterwards.

odbcQuery(exaconn, "CREATE OR REPLACE TABLE FLIGHTS_PRE_SELECTION as select  DAY_OF_WEEK,AIRLINE_ID,FLIGHT_NUM,DEST from    FLIGHTS.FLIGHTS where   ORIGIN = 'PHX' group by DAY_OF_WEEK,AIRLINE_ID,FLIGHT_NUM,DEST")

Next, we execute the UDF with the table we just created:

odbcQuery(exaconn, "CREATE OR REPLACE TABLE FLIGHTS_PRE_SELECTION as select  DAY_OF_WEEK,AIRLINE_ID,FLIGHT_NUM,DEST from    FLIGHTS.FLIGHTS where   ORIGIN = 'PHX' group by DAY_OF_WEEK,AIRLINE_ID,FLIGHT_NUM,DEST")
prediction_output = PredictInExasolFinal( "DAY_OF_WEEK", "AIRLINE_ID", "FLIGHT_NUM", "DEST", table="FLIGHTS_PRE_SELECTION", groupBy = "iproc()" )

The result is as shown below: 

Step 5: Visualize Data

Now that we have predicted the results, we can visualize them. You can do this either by connecting your favorite data visualization tool to Exasol or directly in R.

R Environment

To illustrate results we draw a bar plot on mean of prediction values grouped of day of week.

install.packages("data.table")
library(data.table)
DT <- data.table(prediction_output)
DT <- DT[, mean(PREDICT), by = DAY_OF_WEEK]
DT
install.packages("ggplot2")
library(ggplot2)
# Basic barplot
p<-ggplot(data=DT, aes(x=DAY_OF_WEEK, y=V1)) +
 geom_bar(stat="identity", fill="steelblue")
print(p)

Now you have seen how easy it is to create R UDFs.