v.22.1Experimental Feature

Implemented Sparse Serialization to Optimize Disk Space and Query Performance

Implemented sparse serialization. It can reduce usage of disk space and improve performance of some queries for columns, which contain a lot of default (zero) values. It can be enabled by setting ratio_for_sparse_serialization. Sparse serialization will be chosen dynamically for column, if it has ratio of number of default values to number of all values above that threshold. Serialization (default or sparse) will be fixed for every column in part, but may varies between parts. #22535 (Anton Popov).
Implemented sparse serialization to reduce disk space usage and improve query performance for columns with many default (zero) values.

Why it matters

This feature addresses the inefficiency of storing and processing columns that contain a high ratio of default values by dynamically selecting a sparse serialization method. It reduces disk space consumption and can accelerate query execution on such columns, benefiting users with sparse data distributions.

How to use it

Enable sparse serialization by setting the ratio_for_sparse_serialization parameter. When the ratio of default values to all values in a column exceeds this threshold, sparse serialization will be used dynamically for that column in a part. Serialization mode is fixed per column per part but can differ between parts.