v.21.1Experimental Features

Added minHash and simHash functions for n-grams and shingles for semi-duplicate search

Added functions for calculation of minHash and simHash of text n-grams and shingles. They are intended for semi-duplicate search. Also functions bitHammingDistance and tupleHammingDistance are added. #7649 (flynn).
Added functions to compute minHash and simHash for text n-grams and shingles to enable efficient semi-duplicate search. Additional functions bitHammingDistance and tupleHammingDistance were also introduced for measuring similarity between hashes.

Why it matters

These functions help identify and compare similar or nearly duplicate text data by enabling fast, approximate similarity detection using hashing techniques. This is valuable for users who need to detect duplicates or near-duplicates in large text datasets, improving search efficiency and data quality.

How to use it

Use the newly added functions minHash and simHash on text inputs or their n-grams/shingles to generate hash signatures. Then apply bitHammingDistance or tupleHammingDistance to compare these signatures and estimate similarity or distance. For example, you can write queries using these functions within SELECT statements to perform semi-duplicate detection.