R Package

Exasol provides a package for R that you can use to connect to an Exasol database and then read and write data from/to the database from your R environment. The R package is optimized for reading data using exa.readData() function and exa.writeData() function to write the data into the database on a multi-node cluster.

This section explains how you can install and then use the R package with a sample dataset from Boston Housing.

Prerequisites

To use the R package with Exasol, you need the followings:

Installation

Install Rtools

Install developer extensions for R to build from sources:

  • For Windows: Install Rtools matching your version of R from here.
  • For Linux: Install the R-devel (RPM) or r-base-dev (Debian) package.

Always check the official R-project's download section for your OS to find the latest package.

Install ODBC Driver

Install Exasol ODBC driver by following the steps mentioned in ODBC Driver. After installation is complete, it's recommended to create a DSN pointing to your database instance. For more information about it, see README.txt file of the downloaded ODBC package.

On Linux, you also need the development files for ODBC. Therefore, install unixodbc-devel (RPM) or unixodbc-dev (Debian) package.

Install devtools

You can install the R Package using devtools package. If devtools package is not installed, run the following command:

install.packages("devtools")

Install R Package

Run the following command to install the Exasol R Package:

devtools::install_github("EXASOL/r-exasol")
        # add the Exasol R library
        library(exasol)

Help

You can use the following methods to get help regarding the package:

  • ?exasol: Shows package documentation with examples for each method.
  • ?dbConnect: Shows documentation of individual commands with Exasol-specific parameters.

Connection to Database

You can connect with Exasol database via dbConnect function. In the arguments of the dbConnect function make sure to specify the correct exahost. This is the IP address that connects to the database and might change in case of a virtual machine.

# Connect to Exasol DB with an ODBC DSN
            exaconn <- dbConnect(
            drv     = "exa",                  # EXAdriver object
            exahost = "192.168.56.103:8563",  # IP of database cluster
            uid     = "sys",                  # Username
            pwd     = "exasol")               # Password

            # List all tables in Exasol (returns a character vector).
        dbListTables(exaconn)

The DBI package also has similar functions. If you use the DBI package, you might need to call dbConnect function by using ::: operation in R. For example, exasol:::dbConnect.

Write Data

You can write into Exasol database using the exa.writeData function. This is a low level function and is faster than dbWriteTable function provided by DBI.

Step 1: Load Library in R

Before exploring the arguments of exa.writeData, load some data in R from the mlbench library that contains the BostonHousing dataset.

#Add the mlbench library containing the Boston Housing dataset. This is used as an example dataset 
                library(mlbench)
                # Load some data into R
                data("BostonHousing")

Step 2: Create Schema

Exasol stores data in tables that are organized in schemas, therefore, you need to create a schema for the table you are going to work with.

To create a schema in the Exasol database in R, you can use the RODBC package which implements ODBC database connectivity in R.

# Create an empty table in r_demo with the name `boston_housing` and specify
                # column names and types as per Exasol standards.
                odbcQuery(exaconn,
                "CREATE OR REPLACE TABLE boston_housing(
                crim     DOUBLE,
                zn       DOUBLE,
                indus    DOUBLE,
                chas     VARCHAR(10),
                nox      DOUBLE,
                rm       DOUBLE,
                age      DOUBLE,
                dis      DOUBLE,
                rad      DOUBLE,
                tax      DOUBLE,
                ptratio  DOUBLE,
                b        DOUBLE,
                lstat    DOUBLE,
                medv     DOUBLE
                )")

                # Check again the list of tables contained in the Exasol database
                dbListTables(exaconn)

Step 3: Write Data

The database now contains a new schema and table with its column names. To populate the columns with the actual data, exa.writeData is used. Specify the connection, source data set name, and the table name in the arguments of the function.

# Finally write the data into Exasol
 exa.writeData(exaconn, data = BostonHousing, tableName = "boston_housing")

Step 4: Verify the Data (optional)

Check the newly created table in an SQL client of your choice.

SELECT * FROM r_demo.boston_housing;

Read Data

You can read from Exasol using the exa.readTable function (it's faster than dbReadTable function). In the arguments, specify the connection and the query to be executed. This pulls the query's result over from the database into a data.frame.

# Read table from the Exasol database into the data.frame object called boston_housing_exasol
 boston_housing_exasol = exa.readData(exaconn, "SELECT * FROM boston_housing")
        

You can view the same data that you transferred to the Exasol database.

#Plotting library
    library(ggplot2)

    bhplot = ggplot(boston_housing_exasol, aes(RM, MEDV)) +
    geom_point(colour = "black", shape = 1) +
    ggtitle("House price vs. # of rooms per house")

    plot(bhplot)

A plot of the MEDV (median house value) vs RM (rooms per house) is shown below.

Close Connection

When done, you can clean up and close the database connection.

# Optional: Drop schema and the table it contains
    # odbcQuery(exaconn, "DROP SCHEMA r_demo CASCADE")# Disconnect
    dbDisconnect(exaconn)

Train and Predict through UDFs

If you want to know how to use R and Exasol, both as standalone tools and combined, to run a machine learning algorithm such as Random Forests (RF) on some data, see Train and Predict with R through UDFs.