v.25.12Improvement

Ngrams tokenizer can now be built

Ngrams tokenizer can now be built with ngram_length = 1. #91529 (George Larionov).
The ngrams tokenizer in ClickHouse can now be configured with ngram_length = 1, allowing tokenization into single-character ngrams.

Why it matters

This feature enables more granular tokenization by supporting unigrams, which helps improve text analysis capabilities, especially for languages or applications where single-character tokens are meaningful.

How to use it

When creating or configuring an ngrams tokenizer, set the parameter ngram_length to 1. For example:

ngrams_tokenizer = {
type = 'ngram',
ngram_length = 1
}