Validate GPU setup
This article explains how to validate an installation of NVIDIA GPU drivers on Exasol hosts.
After installing the GPU drivers you can do a basic installation validation using NVIDIA command-line tools. This article explains how to use the NVIDIA tools and what to do if validation fails.
Driver setup validation
Check if the driver major version (major.minor.patch) matches the expected version, that the NVIDIA persistence service (persistence mode) is enabled, and that all expected GPUs are listed (correctly installed).
The following nvidia-smi call shows the available GPU list with name, driver version and persistence mode status:
$ nvidia-smi --query-gpu=name,driver_version,index,persistence_mode --format=csv
name, driver_version, index, persistence_mode
Tesla T4, 535.247.01, 0, Enabled
Container tooling validation
Check that the nvidia-container-cli tool is correctly installed and recognizes the installed GPUs correctly.
The following nvidia-container-cli call shows the current driver version and lists all available GPUs:
$ nvidia-container-cli info
NVRM version: 535.247.01
CUDA version: 12.2
Device Index: 0
Device Minor: 0
Model: Tesla T4
Brand: Nvidia
GPU UUID: GPU-43aae450-f01d-5854-12d2-7ae261f27920
Bus Location: 00000000:00:1e.0
Architecture: 7.5
Verify that GPU acceleration is available
This step requires that the Exasol database is up and running.
To verify that GPU acceleration is active and usable you can query the EXA_METADATA system table.
select PARAM_NAME, PARAM_VALUE from EXA_METADATA where PARAM_NAME = 'acceleratorDeviceDetected';
The query result shows the following output:
PARAM_NAME PARAM_VALUE
------------------------- -----------
acceleratorDeviceDetected 1
The value acceleratorDeviceDetected = 1 means that accelerator devices (in this case, GPUs) are detected and ready to use.
Driver setup troubleshooting
Error:
The nvidia-smi tool fails with the following output:
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
Cause:
The driver kernel modules are not correctly built and/or installed.
Solution:
Validate the DKMS status and reinstall the DKMS driver kernel modules if required.
-
Check the DKMS kernel module status using the commands
dkms status(get DKMS module status) anduname -r(get current kernel version).Example:
Copy$ uname -r
5.14.0-611.5.1.el9_7.x86_64
$ dkms status
nvidia/535.274.02, 5.14.0-611.5.1.el9_7.x86_64, x86_64: installed -
Check that the kernel module
nvidia/<FULL-DRIVER-VERSION>is shown asinstalledfor the expected driver and the current kernel version. -
If the expected DKMS module is not shown as installed, or if it is missing, run the installation explicitly using
dkms autoinstall.Copy$ sudo dkms autoinstall -
After installing the DKMS kernel modules, restart the system.
If this does not resolve the error:
If the error remains after validating and/or reinstalling the kernel modules and restarting the system, a last resort is to reinstall the GPU drivers.
-
Make sure that the host system is updated with the latest OS kernel and packages.
-
Restart the system.
-
To reinstall the drivers, follow the instructions in Install NVIDIA GPU driver on Ubuntu or Install NVIDIA GPU driver on Red Hat Enterprise Linux carefully. Make sure that you install the correct kernel header and development packages for the Linux kernel version.
For more DKMS-related troubleshooting advice, see Frequently Asked Questions – NVIDIA Driver Installation Guide in the NVIDIA documentation.