Train Locally and Predict in Exasol
In this tutorial you will learn how to train a data model locally in R and then predict in Exasol using a UDF.
Prerequisites
-
A running Exasol database
-
R development environment such as RStudio
Load data and train in R
Load libraries in R
This tutorial uses data from the BostonHousing
sample data set in the mlbench
library.
Also load the libraries RODBC
, exasol
, randomForest
, and RCurl
, which will be required in this tutorial.
# load libraries
library(mlbench) # Load data
library(RODBC) # ODBC database connectivity
library(exasol) # R interface for Exasol database
library(randomForest) # Random forests for regression and classification
library(RCurl) # HTTP communication
Load data and split into training and testing sets
Load the data into R, then split it into a training set and a testing set. The ratio of the split is arbitrary and depends on the available data and application, but in this case you will keep 75% of the data for training and use the remaining 25% for testing purposes.
There are several methods that you can use to split the data. In this tutorial we use a method that randomly generates indexes that subset rows from the main data set.
# Load data
data("BostonHousing")
# Generate random indices (75% of the sample size)
set.seed(42)
indices <- sample(nrow(BostonHousing),size=floor(nrow(BostonHousing) * 0.75))
# Add a dummy variable with 2 levels "Train" and "Test"
BostonHousing$split = factor(NA, levels = c("Train", "Test"))
# Assign predefined rows respectively to "Train" and "Test" subsets
BostonHousing$split[indices] = "Train"
BostonHousing$split[-indices] = "Test"
# Add id variable
BostonHousing$id = seq(1:nrow(BostonHousing))
Train the model
You can now train a basic random forest model on our data and see which variables contribute the most to predicting the median house value.
set.seed(42)
# Run model (note that we filter only for the training data (split=="Train"))
rf_model = randomForest(medv ~ .-split -id,
data = BostonHousing[BostonHousing$split=="Train",],
importance = TRUE)
# See which variables are important in predicting house prices
varImpPlot(rf_model)
# rm - number of rooms per house
# lstat - % lower status of the population
Upload the model to BucketFS
UDF scripts are executed in parallel on the Exasol cluster. The UDF that you will use for prediction must have access to the model that you just trained. You can store the model on any file service, but for performance reasons we recommend that you upload it to BucketFS on the cluster nodes.
BucketFS is a synchronous file system available on all database nodes in an Exasol cluster. You can upload files to BucketFS in R using packages that allow HTTP communication, such as httr
or RCurl
. In this tutorial you will use the RCurl
package to upload the trained model.
The model that lives in the R environment has to be serialized (converted into raw text) when it is uploaded to BucketFS. When it is retrieved from BucketFS it must be unserialized and returned to its original file type.
To learn more about serialization in R, enter ?serialize
in RStudio.
For more information about BucketFS, see BucketFS.
Values in angle brackets such as <bucket_name>
are placeholders. Replace these with your own values.
# Define options for the authentication
curl_opts = curlOptions(userpwd = "w:<write_passwd>",
verbose = FALSE,
httpauth = AUTH_BASIC)
# Transfer model to the bucket
httpPUT(
# EXABucket URL (change for your environment)
url = "http://<Exasol server>:<BucketFS service port>/<Bucket name>/rf_model",
# It is important to serialize the model
content = serialize(rf_model, ascii = FALSE, connection = NULL),
# EXABucket: authenticate
curl = getCurlHandle(.opts = curl_opts)
)
Load data into Exasol
Load the data into the Exasol database by opening a connection and using the exa.writeData
function in the Exasol R package.
To know more about how to write data into an Exasol database from R, see Write data.
# Create connection with the Exasol database
exaconn <- dbConnect(
drv = "exa", # EXAdriver object
exahost = "192.168.56.103:8563", # IP of database cluster
uid = "sys", # Username
pwd = "exasol") # Password
# Create database schema (if it does not yet exist) with the name r_demo
# (This also opens it, i.e. makes it the default container for all subsequent steps below)
odbcQuery(exaconn, "CREATE SCHEMA IF NOT EXISTS r_demo")
# If the schema already existed, we need to open it to make it the default container for all subsequent steps below
odbcQuery(exaconn, "OPEN SCHEMA r_demo")
# Create an empty table in Exasol with the name boston_housing
odbcQuery(exaconn,
"CREATE OR REPLACE TABLE boston_housing(
crim DOUBLE,
zn DOUBLE,
indus DOUBLE,
chas VARCHAR(10),
nox DOUBLE,
rm DOUBLE,
age DOUBLE,
dis DOUBLE,
rad DOUBLE,
tax DOUBLE,
ptratio DOUBLE,
b DOUBLE,
lstat DOUBLE,
medv DOUBLE,
split VARCHAR(10),
id INT
)")
# Write the train data into Exasol
exa.writeData(exaconn, data = BostonHousing, tableName = "boston_housing")
To verify that the data is transferred to the corresponding schema, you can use an SQL client to connect to the database.
Predict in Exasol
To run prediction on the data in the r_demo
schema, use the exa.createScript
function to create a UDF that uses the algorithm stored in BucketFS.
# Create UDF for prediction
PredictInExasol1 <- exa.createScript(
exaconn,
"r_demo.dt_predict1",
function(data) {
# Load the required packages
require(RCurl)
require(randomForest)
# Load data in chunks of 1000 rows at a time
repeat {
if (!data$next_row(1000))
break
# put data in a data.frame
df <- data.frame(
id = data$id,
crim = data$crim,
zn = data$zn,
indus = data$indus,
chas = data$chas,
nox = data$nox,
rm = data$rm,
age = data$age,
dis = data$dis,
rad = data$rad,
tax = data$tax,
ptratio = data$ptratio,
b = data$b,
lstat = data$lstat,
medv = data$medv,
split = data$split
)
}
# Set options for retrieving model from bucket
curl_opts = curlOptions(userpwd = "w:<write_passwd>",
verbose = TRUE,
httpauth=AUTH_BASIC)
# Loading the model from the bucket (note that is unserialized)
# Change the bucket information for your environment
rf_model = unserialize(file("/buckets/<BucketFS service name>/<Bucket name>/rf_model", "rb"))
# Use the loaded model to make the prediction
prediction <- predict(rf_model, newdata = df)
# Return of the forecast
data$emit(df$id, df$medv, prediction)
},
# Input arguments
inArgs = c("id INT",
"crim DOUBLE",
"zn DOUBLE",
"indus DOUBLE",
"chas VARCHAR(10)",
"nox DOUBLE",
"rm DOUBLE",
"age DOUBLE",
"dis DOUBLE",
"rad DOUBLE",
"tax DOUBLE",
"ptratio DOUBLE",
"b DOUBLE",
"lstat DOUBLE",
"medv DOUBLE",
"split VARCHAR(10)"),
# Output arguments
outArgs = c("id INT",
"RealValue DOUBLE",
"Prediction DOUBLE")
)
# Create a table with the real values and predicted ones
# Note that the prediction is done in test data using 'where' argument
prediction_output = PredictInExasol1("id","crim", "zn", "indus", "chas", "nox", "rm", "age", "dis",
"rad", "tax", "ptratio", "b", "lstat", "medv", "split",
table = "r_demo.boston_housing",
groupBy = "iproc(),mod(iproc(),5)",
where = "split = 'Test'")
# Check the root mean squared error (RMSE)
RMSE = sqrt(mean((prediction_output$PREDICTION - prediction_output$REALVALUE)^2))
# Should be 3.35
You can also plot your predicted values and see the output:
# Plot the predictions against real values
library(ggplot2)
plot_predictions = ggplot(prediction_output, aes(ID)) +
geom_line(aes(y = PREDICTION, colour = "Prediction")) +
geom_line(aes(y = REALVALUE, colour = "True Value")) +
xlab("") +
ylab("Results") +
ggtitle("Predicted vs Real values")
theme(legend.title=element_blank())
plot(plot_predictions)