Validate GPU setup

This article explains how to validate an installation of NVIDIA GPU drivers on Exasol hosts.

After installing the GPU drivers you can do a basic installation validation using NVIDIA command-line tools. This article explains how to use the NVIDIA tools and what to do if validation fails.

Driver setup validation

Check if the driver major version (major.minor.patch) matches the expected version, that the NVIDIA persistence service (persistence mode) is enabled, and that all expected GPUs are listed (correctly installed).

The following nvidia-smi call shows the available GPU list with name, driver version and persistence mode status:

Copy
$ nvidia-smi --query-gpu=name,driver_version,index,persistence_mode --format=csv

name, driver_version, index, persistence_mode 
Tesla T4, 535.247.01, 0, Enabled

Container tooling validation

Check that the nvidia-container-cli tool is correctly installed and recognizes the installed GPUs correctly.

The following nvidia-container-cli call shows the current driver version and lists all available GPUs:

Copy
$ nvidia-container-cli info 
NVRM version:   535.247.01 
CUDA version:   12.2  
Device Index:   0 
Device Minor:   0 
Model:          Tesla T4 
Brand:          Nvidia 
GPU UUID:       GPU-43aae450-f01d-5854-12d2-7ae261f27920 
Bus Location:   00000000:00:1e.0 
Architecture:   7.5

Verify that GPU acceleration is available

This step requires that the Exasol database is up and running.

To verify that GPU acceleration is active and usable you can query the EXA_METADATA system table.

Copy
select PARAM_NAME, PARAM_VALUE from EXA_METADATA where PARAM_NAME = 'acceleratorDeviceDetected';

The query result shows the following output:

Copy
PARAM_NAME                 PARAM_VALUE  
-------------------------  -----------  
acceleratorDeviceDetected  1            

The value acceleratorDeviceDetected = 1 means that accelerator devices (in this case, GPUs) are detected and ready to use.

Driver setup troubleshooting

Error:

The nvidia-smi tool fails with the following output:

Copy
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Cause:

The driver kernel modules are not correctly built and/or installed.

Solution:

Validate the DKMS status and reinstall the DKMS driver kernel modules if required.

  1. Check the DKMS kernel module status using the commands dkms status (get DKMS module status) and uname -r (get current kernel version).

    Example:
    Copy
    $ uname -r
    5.14.0-611.5.1.el9_7.x86_64
    $ dkms status
    nvidia/535.274.02, 5.14.0-611.5.1.el9_7.x86_64, x86_64: installed
  2. Check that the kernel module nvidia/<FULL-DRIVER-VERSION> is shown as installed for the expected driver and the current kernel version.

  3. If the expected DKMS module is not shown as installed, or if it is missing, run the installation explicitly using dkms autoinstall.

    Copy
    $ sudo dkms autoinstall
  4. After installing the DKMS kernel modules, restart the system.

If this does not resolve the error:

If the error remains after validating and/or reinstalling the kernel modules and restarting the system, a last resort is to reinstall the GPU drivers.

  1. Make sure that the host system is updated with the latest OS kernel and packages.

  2. Restart the system.

  3. To reinstall the drivers, follow the instructions in Install NVIDIA GPU driver on Ubuntu or Install NVIDIA GPU driver on Red Hat Enterprise Linux carefully. Make sure that you install the correct kernel header and development packages for the Linux kernel version.

For more DKMS-related troubleshooting advice, see Frequently Asked Questions – NVIDIA Driver Installation Guide in the NVIDIA documentation.