MLflow model serving

Learn how to serve models as HTTP endpoints using the Exasol MLflow Server.

The Exasol MLflow Server lets you deploy Hugging Face models as HTTP endpoints and call them from Exasol UDFs or external applications. It provides an MLflow-compatible serving layer that sits between your database and your models, so you can run inference over HTTP without embedding model execution directly in UDF code.

How it works

The MLflow server loads a Hugging Face model and exposes it through an HTTP API. Your Exasol UDFs send inference requests to this endpoint and receive predictions back as HTTP responses. External applications outside of Exasol can also call the same endpoint.

This architecture separates model hosting from database execution. The model runs in its own process with dedicated resources (CPU or GPU), while Exasol handles the data retrieval and result processing. You can scale the model server independently of your database cluster.

When to use MLflow model serving

Exasol offers two main approaches for running inference against ML models. Each fits different requirements.

Approach How it works Best for
MLflow model serving Models run as HTTP endpoints on a separate server Teams already using MLflow; models that need dedicated GPU resources; serving a model to multiple consumers (Exasol + other apps)
Direct UDF inference (Transformers Extension) Models run inside Exasol UDFs on the database nodes Low-latency inference on data already in Exasol; simpler deployment with no external services

Choose MLflow model serving when you want a single model endpoint that multiple systems can call, or when your team already manages models through MLflow's tracking and registry workflow. Choose direct UDF inference when you want the simplest setup and your inference workload runs entirely within Exasol. For a broader comparison of all model connection paths, see the introduction in Connect to AI models.

Prerequisites

  • Python 3 (see Exasol MLflow Server on GitHub for the minimum supported version)
  • An Exasol database instance (version 7.1 or later)
  • Network connectivity between your Exasol cluster and the machine hosting the MLflow server

Set up the MLflow server

The following steps illustrate the general setup pattern. The repository README has the most current installation and configuration instructions.

Install the server

Clone the repository and install the dependencies:

Copy
git clone https://github.com/exasol-labs/exasol-labs-mlflow-server.git
cd exasol-labs-mlflow-server
pip install -r requirements.txt

Start the server

Copy
python -m exasol_mlflow_server --model <model-name> --host 0.0.0.0 --port 5000

Replace <model-name> with the Hugging Face model identifier you want to serve (for example, distilbert-base-uncased). For the full list of available startup flags, see Exasol MLflow Server on GitHub.

Call the endpoint from Exasol

Once the server is running, you can call it from a Python UDF in Exasol using HTTP requests. The example illustrates the general pattern. The actual request format and endpoint path depend on the server’s API.

Copy
CREATE OR REPLACE PYTHON3 SCALAR SCRIPT MY_SCHEMA.PREDICT_VIA_MLFLOW(input_text VARCHAR(2000))
RETURNS VARCHAR(2000) AS
import requests

def run(ctx):
    response = requests.post(
        'http://<mlflow-server-host>:5000/invocations',
        json={"inputs": [ctx.input_text]},
        headers={"Content-Type": "application/json"}
    )
    return response.json()["predictions"][0]
/

Replace <mlflow-server-host> with the hostname or IP address of the machine running the MLflow server.

You can then call this UDF from SQL:

Copy
SELECT MY_SCHEMA.PREDICT_VIA_MLFLOW(text_column)
FROM MY_SCHEMA.MY_TABLE;

Architecture considerations

Resource isolation. Because the model runs on a separate server, you avoid consuming CPU and memory on the Exasol database nodes for inference. This matters for large models or high-throughput inference workloads.

Network latency. Every inference call is an HTTP round trip. For bulk inference on millions of rows, the network overhead can add up. If latency is a concern, consider the Transformers Extension for in-database inference instead.

Model lifecycle. MLflow provides built-in model versioning, experiment tracking, and a model registry. If your team already uses these features, the MLflow server integrates naturally into that workflow.

Further reading