Apache Airflow

You can easily integrate Exasol into Apache Airflow, a platform to programmatically author, schedule, and monitor work flow. This section provides you with information on how to connect the Exasol database to Apache Airflow.

Prerequisite

  • Exasol JDBC driver installed on the Apache Airflow server. You can download the driver from the Exasol downloads section.

Installation and Configuration of Airflow

In this document, we will provide you with steps to set up an instance of Airflow running on CentOS 7.

  1. Install CentOS 7 on a local or virtual machine and select Server GUI as a base. Perform an update, for example: 
  2. sudo yum -y update
    uname -a
    Linux centos7.exasol.local 3.10.0-957.12.2.el7.x86_64 #1 SMP Tue May
    14 21:24:32 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
  3. Next, install the EPEL (Extra Packages for Enterprise Linux) repository on CentOS, an additional package repository that provides easy access to installing packages for commonly used software. To install the EPEL repository, run the following command:
  4. sudo yum install epel-release
  5. Install Pip, which is a tool for installing and managing Python packages.
  6. sudo yum -y install python-pip
    sudo pip install --upgrade pip
  7. Install Apache Airflow "Common" (such as MySQL, Celery, Crypto, or Password auth).
  8. sudo yum -y install gcc gcc-c++ libffi-devel mariadb-devel pythondevel
    sudo pip install psutil
    sudo pip install --upgrade setuptools
    sudo pip install 'apacheairflow[async,celery,crypto,jdbc,mysql,password,rabbitmq]'
  9. Create a new directory “airflow” in the home (~) directory, set it as airflow home, and install the airflow in it:
  10. mkdir -p /usr/opt/airflow
    groupadd -g 500 airflow
    useradd -g 500 -u 500 -M -d /usr/opt/airflow/ airflow
    chown -R airflow:airflow /usr/opt/airflow/

    In the example, we have created the Airflow Home directory in the following location - /usr/opt/airflow.
    Additionally, we have created a group called Airflow and changed the owner to this group with all the relevant permissions.
    This is an optional step.

  11. Install Airflow. For more information, see Apache Airflow Installation.
  12. Next, initialize the database. This will create the necessary configuration files in the Airflow directory.
  13. export AIRFLOW_HOME=/usr/opt/airflow
    airflow initdb
  14. Start the Airflow web server as daemon, and the default port of 8080.
  15. airflow webserver -D -p 8080
  16. Start the Airflow scheduler as daemon.
  17. airflow scheduler -D
  18. Once the Airflow web server and scheduler are running successfully, you can access the Airflow Admin UI. Open a browser and enter the following details: 
  19. http://localhost:8080/admin/

    The first time you run Airflow, it will create a file called airflow.cfg in your $AIRFLOW_HOME directory (~/airflow by default). This file contains Airflow’s configuration and you can edit it to change any of the settings. You can also access this file via the UI by navigating to Admin > Configuration menu.

  20. Next, create a connection to connect Airflow to external systems. To create a new connection using the UI, navigate to the Admin console in the browser and select Connection > Create. The following screen is displayed:
  21. Enter the following details: 

    • Conn Id: The ID of the connection for reference within the Airflow DAGs.
    • Conn Type: The type of connection. In this case, it is JDBC Connection.
    • Connection URL: Enter the connection URL. The JDBC driver uses this URL structure - jdbc:exa:<host>:<port>[;<prop_1>=<value_1>]...[;<prop_n>=<value_n>].
    • Driver Path: The location where the driver is installed on the Airflow Server.
    • Driver Class: The main class for the Exasol driver. For example - com.exasol.jdbc.EXADriver.

Test the Connection

You can test the connection by running an Ad Hoc query. The Ad Hoc query enables simple SQL interactions with the database connections registered in Airflow.

In the Admin console, navigate to Data Profiling > Ad Hoc Query. Select the Exasol Connection you created and execute any SQL query to test the connection.

Create DAGs

A DAG (Directed Acyclic Graph) is a collection of all the tasks you want run in an organized way. A DAG is defined in a Python script, which represents the DAGs structure (tasks and their dependencies) as code.

You can create a DAG by defining the script and adding it to a folder, for example "dags", within the $AIRFLOW_HOME directory. In our case, the directory to which we need to add DAGs is user/opt/airflow/dags.

Example:

The following is an example DAG that connects to an Exasol database and run simple SELECT/IMPORT statements:

from airflow import DAG
from airflow.operators.jdbc_operator import JdbcOperator
from datetime import datetime, timedelta
# Following are defaults which can be overridden later on
default_args = {
'owner': 'exasol_test',
'depends_on_past': False,
'start_date': datetime(2019, 5, 31),
'email': ['exasol_test@exasol.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=1),
}
dag = DAG('Exasol_DB_Checks', default_args=default_args)
# sql_task1 and sql_task2 are examples of tasks created using operators
sql_task1 = JdbcOperator(
task_id='sql_cmd',
jdbc_conn_id='Exasol_db',
template_searchpath='/usr/opt/airflow/templates',
sql=['select current_timestamp;',
'select current_user from DUAL;',
'insert into TEST.AIRFLOW values(current_timestamp, current_user,
current_session);'],
autocommit=False,
params={"db":'exa_db_61'},
dag=dag
)
sql_task2 = JdbcOperator(
task_id='sql_cmd_file',
jdbc_conn_id='Exasol_db',
template_searchpath='/usr/opt/airflow/templates',
sql=['check_events.sql','check_usage.sql','insert_run.sql'],
autocommit=False,
params={"db":'exa_db_61'},
dag=dag
)
sql_task2.set_upstream(sql_task1)

As you can see in the above example, you can perform direct SQLs or call external files with the .sql extension.

Additional Information