v.21.11New Features

Added tokens and ngrams functions for text processing in ClickHouse

Added function tokens. That allow to split string into tokens using non-alpha numeric ASCII characters as separators. #29981 (Maksim Kita). Added function ngrams to extract ngrams from text. Closes #29699. #29738 (Maksim Kita).
Introduced new functions tokens and ngrams to process text by splitting strings into tokens using non-alphanumeric ASCII separators and extracting n-grams respectively.

Why it matters

These functions enable users to efficiently tokenize text data and generate n-grams, which are essential for advanced text analysis, search optimization, and natural language processing tasks within ClickHouse.

How to use it

Use the tokens function to split a string into tokens based on non-alphanumeric ASCII characters as separators. Use the ngrams function to extract n-grams from text by specifying the desired n-gram length.

Example usage:
SELECT tokens('example-text_string')
SELECT ngrams('example text', 2)