Predict Through UDF in SQL
In this tutorial you will learn how to run the prediction model from the Exasol native environment in a SQL development tool.
This tutorial shows how you can deploy the prediction model using a UDF that dynamically executes R code in a SQL development tool. The procedure is similar to what is described in Train and Predict in Exasol, except that you use the context object ctx
to get the data from the database.
Prerequisites
You must have completed the Train Locally and Predict in Exasol tutorial.
Create prediction function
-- 1. Create prediction function
CREATE OR REPLACE R SET SCRIPT r_demo.predict3(...) EMITS (id INT, RealValue DOUBLE, Prediction DOUBLE) AS
# Load library
library(randomForest)
# Get number of columns
numCols <- exa$meta$input_column_count
# Create empty list with length numCols
l = vector("list", numCols)
run <- function(ctx) {
# Load the model from the bucket
rf_model = readRDS("/buckets/bucketfs1/bucket1/rf_model")
# ...or depending on the file type
#load("/buckets/bucketfs1/bucket1/rf_model.dat")
# Split input into modest sized chunks, this can be adjusted for memory consumption
repeat{
if (!ctx$next_row(1000000))
break
# Populate list with data columns
for (i in 1:numCols){
l[[i]] <- ctx[[i]]()
}
# Convert to data.frame for modeling
df = as.data.frame(l)
# Set the column names
# From the model object, note: model/scenario specific
# the_colnames <- names(rf_model$forest$xlevels)
# colnames(df) = the_colnames
# Assign manually, note: not dynamic
# Note: case sensitive when using double quotes
colnames(df) <- c("id", "crim", "zn", "indus", "chas", "nox", "rm",
"age", "dis", "rad", "tax", "ptratio", "b", "lstat", "medv", "split")
# Get predictions
prediction <- predict(rf_model, newdata = df)
# Output ids, real values and predicted ones
ctx$emit(df$id, df$medv, prediction)
}
}
/
Run prediction function
-- 2. Run prediction function
SELECT r_demo.predict3(id,crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, b, lstat, medv, split)
FROM r_demo.boston_housing
WHERE split = 'Test'
GROUP BY IPROC() -- Node number for data locality
, MOD(ROWNUM, 5) -- Number of processes per node
ORDER BY id;
The UDFs you created are saved in your schema and can be accessed using an SQL client.