v.21.1Experimental Features
Added minHash and simHash functions for n-grams and shingles for semi-duplicate search
Added functions for calculation of minHash and simHash of text n-grams and shingles. They are intended for semi-duplicate search. Also functionsbitHammingDistanceandtupleHammingDistanceare added. #7649 (flynn).
Why it matters
These functions help identify and compare similar or nearly duplicate text data by enabling fast, approximate similarity detection using hashing techniques. This is valuable for users who need to detect duplicates or near-duplicates in large text datasets, improving search efficiency and data quality.How to use it
Use the newly added functionsminHash and simHash on text inputs or their n-grams/shingles to generate hash signatures. Then apply bitHammingDistance or tupleHammingDistance to compare these signatures and estimate similarity or distance. For example, you can write queries using these functions within SELECT statements to perform semi-duplicate detection.