v.25.12Improvement
Ngrams tokenizer can now be built
Ngrams tokenizer can now be built with ngram_length = 1. #91529 (George Larionov).
Why it matters
This feature enables more granular tokenization by supporting unigrams, which helps improve text analysis capabilities, especially for languages or applications where single-character tokens are meaningful.How to use it
When creating or configuring an ngrams tokenizer, set the parameterngram_length to 1. For example:ngrams_tokenizer = {
type = 'ngram',
ngram_length = 1
}