v.21.11New Features
Added tokens and ngrams functions for text processing in ClickHouse
Added functiontokens. That allow to split string into tokens using non-alpha numeric ASCII characters as separators. #29981 (Maksim Kita). Added functionngramsto extract ngrams from text. Closes #29699. #29738 (Maksim Kita).
Why it matters
These functions enable users to efficiently tokenize text data and generate n-grams, which are essential for advanced text analysis, search optimization, and natural language processing tasks within ClickHouse.How to use it
Use thetokens function to split a string into tokens based on non-alphanumeric ASCII characters as separators. Use the ngrams function to extract n-grams from text by specifying the desired n-gram length.Example usage:
SELECT tokens('example-text_string')
SELECT ngrams('example text', 2)