v.22.2New Feature

Add Custom Deduplication Semantic Setting in MergeTree/ReplicatedMergeTree

Add a setting that allows a user to provide own deduplication semantic in MergeTree/ReplicatedMergeTree If provided, it's used instead of data digest to generate block ID. So, for example, by providing a unique value for the setting in each INSERT statement, the user can avoid the same inserted data being deduplicated. This closes: #7461. #32304 (Igor Nikonov).
Introduces a setting for MergeTree and ReplicatedMergeTree tables that allows users to provide custom deduplication semantics by supplying their own block ID generation method instead of using the default data digest.

Why it matters

This feature enables users to control deduplication behavior more precisely during data insertion. By specifying a unique value for the custom setting in each INSERT statement, users can prevent identical data blocks from being deduplicated, which is useful when intentional duplicate inserts need to be preserved.

How to use it

Users can enable this feature by setting the new deduplication semantic setting to a custom value that uniquely identifies each inserted block. For example, include a unique identifier in each INSERT query to ensure that ClickHouse treats each block as distinct and does not skip deduplication based on data content.