Predict Through UDF in SQL

In this tutorial you will learn how to run the prediction model from the Exasol native environment in a SQL development tool.

This tutorial shows how you can deploy the prediction model using a UDF that dynamically executes R code in a SQL development tool. The procedure is similar to what is described in Train and Predict in Exasol, except that you use the context object ctx to get the data from the database.

Prerequisites

You must have completed the Train Locally and Predict in Exasol tutorial.

Create prediction function

-- 1. Create prediction function
CREATE OR REPLACE R SET SCRIPT r_demo.predict3(...) EMITS (id INT, RealValue DOUBLE, Prediction DOUBLE) AS

# Load library
library(randomForest)

# Get number of columns
numCols <- exa$meta$input_column_count

# Create empty list with length numCols
l = vector("list", numCols)

run <- function(ctx) {

  # Load the model from the bucket
  rf_model = readRDS("/buckets/bucketfs1/bucket1/rf_model")

  # ...or depending on the file type
  #load("/buckets/bucketfs1/bucket1/rf_model.dat")

  # Split input into modest sized chunks, this can be adjusted for memory consumption
  repeat{
    if (!ctx$next_row(1000000))
        break

  # Populate list with data columns
  for (i in 1:numCols){

        l[[i]] <- ctx[[i]]()

  }

  # Convert to data.frame for modeling
  df = as.data.frame(l)

  # Set the column names

  # From the model object, note: model/scenario specific
  # the_colnames <- names(rf_model$forest$xlevels)
  # colnames(df) = the_colnames

  # Assign manually, note: not dynamic
  # Note: case sensitive when using double quotes
  colnames(df) <- c("id", "crim", "zn", "indus", "chas", "nox", "rm",
  "age", "dis", "rad", "tax", "ptratio", "b", "lstat", "medv", "split")

  # Get predictions
  prediction   <- predict(rf_model, newdata = df)

  # Output ids, real values and predicted ones
  ctx$emit(df$id, df$medv, prediction)
  }
}
/

Run prediction function

-- 2. Run prediction function
SELECT r_demo.predict3(id,crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, b, lstat, medv, split)
FROM   r_demo.boston_housing
WHERE  split = 'Test'
GROUP  BY IPROC()         -- Node number for data locality
        , MOD(ROWNUM, 5) -- Number of processes per node
ORDER  BY id;

The UDFs you created are saved in your schema and can be accessed using an SQL client.