Predict Through UDF in SQL

This section explains how to run the prediction model from the Exasol native environment in the SQL development tool.

The model can be deployed using a UDF that dynamically executes R code. This is similar to what is described in Train and Predict in Exasol, except for the way you get the data from the database via the context object ctx.

For more details on R UDFs, see R.

-- 1. Create prediction function
CREATE OR REPLACE R SET SCRIPT r_demo.predict3(...) EMITS (id INT, RealValue DOUBLE, Prediction DOUBLE) AS

# Load library
library(randomForest)

# Get number of columns
numCols <- exa$meta$input_column_count

# Create empty list with length numCols
l = vector("list", numCols)

run <- function(ctx) {

  # Load the model from the bucket
  rf_model = readRDS("/buckets/bucketfs1/bucket1/rf_model")

  # ...or depending on the file type
  #load("/buckets/bucketfs1/bucket1/rf_model.dat")

  # Split input into modest sized chunks, this can be adjusted for memory consumption
  repeat{
    if (!ctx$next_row(1000000))
        break

  # Populate list with data columns
  for (i in 1:numCols){

        l[[i]] <- ctx[[i]]()

  }

  # Convert to data.frame for modeling
  df = as.data.frame(l)

  # Set the column names

  # From the model object, note: model/scenario specific
  # the_colnames <- names(rf_model$forest$xlevels)
  # colnames(df) = the_colnames

  # Assign manually, note: not dynamic
  # Note: case sensitive when using double quotes
  colnames(df) <- c("id", "crim", "zn", "indus", "chas", "nox", "rm",
  "age", "dis", "rad", "tax", "ptratio", "b", "lstat", "medv", "split")

  # Get predictions
  prediction   <- predict(rf_model, newdata = df)

  # Output ids, real values and predicted ones
  ctx$emit(df$id, df$medv, prediction)
  }
}
/

-- 2. Run prediction function
SELECT r_demo.predict3(id,crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, b, lstat, medv, split)
FROM   r_demo.boston_housing
WHERE  split = 'Test'
GROUP  BY IPROC()         -- Node number for data locality
        , MOD(ROWNUM, 5) -- Number of processes per node
ORDER  BY id;

The UDFs you created are saved in your schema and can be accessed using an SQL client.