v.21.12New Features

Major Enhancements to Data Format Completeness and Consistency

TLDR: Major improvements of completeness and consistency of text formats. Refactor formats TSV, TSVRaw, CSV and JSONCompactEachRow, JSONCompactStringsEachRow, remove code duplication, add base interface for formats with -WithNames and -WithNamesAndTypes suffixes. Add formats CSVWithNamesAndTypes, TSVRawWithNames, TSVRawWithNamesAndTypes, JSONCompactEachRowWIthNames, JSONCompactStringsEachRowWIthNames, RowBinaryWithNames. Support parallel parsing for formats TSVWithNamesAndTypes, TSVRaw(WithNames/WIthNamesAndTypes), CSVWithNamesAndTypes, JSONCompactEachRow(WithNames/WIthNamesAndTypes), JSONCompactStringsEachRow(WithNames/WIthNamesAndTypes). Support columns mapping and types checking for RowBinaryWithNamesAndTypes format. Add setting input_format_with_types_use_header which specify if we should check that types written in <format_name>WIthNamesAndTypes format matches with table structure. Add setting input_format_csv_empty_as_default and use it in CSV format instead of input_format_defaults_for_omitted_fields (because this setting should not control csv_empty_as_default). Fix usage of setting input_format_defaults_for_omitted_fields (it was used only as csv_empty_as_default, but it should control calculation of default expressions for omitted fields). Fix Nullable input/output in TSVRaw format, make this format fully compatible with inserting into TSV. Fix inserting NULLs in LowCardinality(Nullable) when input_format_null_as_default is enabled (previously default values was inserted instead of actual NULLs). Fix strings deserialization in JSONStringsEachRow/JSONCompactStringsEachRow formats (strings were parsed just until first '\n' or '\t'). Add ability to use Raw escaping rule in Template input format. Add diagnostic info for JSONCompactEachRow(WithNames/WIthNamesAndTypes) input format. Fix bug with parallel parsing of -WithNames formats in case when setting min_chunk_bytes_for_parallel_parsing is less than bytes in a single row. #30178 (Kruglov Pavel). Allow to print/parse names and types of colums in CustomSeparated input/output format. Add formats CustomSeparatedWithNames/WithNamesAndTypes similar to TSVWithNames/WithNamesAndTypes. #31434 (Kruglov Pavel).
Major improvements and refactoring of text and binary data formats in ClickHouse, including new variants with column names and types, enhanced parsing consistency, and support for parallel parsing.

Why it matters

These changes address inconsistencies and code duplication among several input/output formats (TSV, CSV, JSONCompactEachRow, RowBinary, etc.) by introducing a unified interface and extending support for column names and types. This increases format completeness, correctness, and usability, improves type checking, enables parallel parsing for better performance, and fixes various bugs related to NULL handling and string deserialization.

How to use it

Users can enable the new functionality by using the new format variants suffixed with WithNames and WithNamesAndTypes (e.g., CSVWithNamesAndTypes, TSVRawWithNames). Configuration settings such as input_format_with_types_use_header control the validation of types against table structure, while input_format_csv_empty_as_default adjusts how empty CSV fields are interpreted. Parallel parsing is automatically supported for these formats, and additional settings like min_chunk_bytes_for_parallel_parsing influence its behavior. For CustomSeparated formats, users can now also include names and types similarly.