Character Encoding

This article describes character encoding support and sorting rules in Exasol.

Supported character sets and encoding

Exasol supports the Unicode and ASCII character sets. Unicode characters are defined using code points, which in Exasol are transformed to binary values using the UTF-8 character encoding standard (UTF = Unicode Transformation Format). The first 128 code points in Unicode are identical to the ASCII character set, which makes ASCII a subset of Unicode.

Exasol does not support SQL collation.

UTF‑8

Binary encoding of the code point value of Unicode characters, using 1 to 4 bytes per character depending on the value of the code point.

The first 128 code points are identical to the ASCII character set.

ASCII

Binary encoding using 1 byte per character, only the lower 7 bits allowed.

Extended ASCII (8 bits) is not supported in Exasol.

Examples:
Character Decimal value Unicode code point UTF-8 binary ASCII binary (7-bit)
NULL (control character) 0 U+0000 00000000 00000000
SPACE 32 U+0020 00100000 00100000
A 65 U+0041 01000001 01000001
a 97 U+0061 01100001 01100001
DELETE (control character) 127 U+007F 00011111 01111111
Ä 196 U+00C4 11000011 10000100 -
ä 228 U+00E4 11000011 10100100 -

Sorting behavior

When comparing strings explicitly using < > <= >=, or implicitly using ORDER BY, the sort order is based on binary character values. This results in the following sorting behavior:

  • Sorting is case-sensitive. UPPERCASE characters are sorted before lowercase in ascending order. For example, aardvark will come after Zebra when sorting in ascending order.

  • Sorting is accent-sensitive. Characters with accents (diacritics) are compared and sorted according to their UTF-8 binary representation. For example, Åhus is sorted after both Axberg and Zoo in ascending order.