v.25.12Improvement

Ngrams tokenizer can now be built

Published: November 25, 2025

Ngrams tokenizer can now be built with ngram_length = 1. #91529 (George Larionov).

The ngrams tokenizer in ClickHouse can now be configured with ngram_length = 1, allowing tokenization into single-character ngrams.

Why it matters

This feature enables more granular tokenization by supporting unigrams, which helps improve text analysis capabilities, especially for languages or applications where single-character tokens are meaningful.

How to use it

When creating or configuring an ngrams tokenizer, set the parameter ngram_length to 1. For example:

ngrams_tokenizer = {
    type = 'ngram',
    ngram_length = 1
}

Related resources

Pull request #91529