v.21.12New Features
Major Enhancements to Data Format Completeness and Consistency
TLDR: Major improvements of completeness and consistency of text formats. Refactor formatsTSV,TSVRaw,CSVandJSONCompactEachRow,JSONCompactStringsEachRow, remove code duplication, add base interface for formats with-WithNamesand-WithNamesAndTypessuffixes. Add formatsCSVWithNamesAndTypes,TSVRawWithNames,TSVRawWithNamesAndTypes,JSONCompactEachRowWIthNames,JSONCompactStringsEachRowWIthNames,RowBinaryWithNames. Support parallel parsing for formatsTSVWithNamesAndTypes,TSVRaw(WithNames/WIthNamesAndTypes),CSVWithNamesAndTypes,JSONCompactEachRow(WithNames/WIthNamesAndTypes),JSONCompactStringsEachRow(WithNames/WIthNamesAndTypes). Support columns mapping and types checking forRowBinaryWithNamesAndTypesformat. Add settinginput_format_with_types_use_headerwhich specify if we should check that types written in<format_name>WIthNamesAndTypesformat matches with table structure. Add settinginput_format_csv_empty_as_defaultand use it in CSV format instead ofinput_format_defaults_for_omitted_fields(because this setting should not controlcsv_empty_as_default). Fix usage of settinginput_format_defaults_for_omitted_fields(it was used only ascsv_empty_as_default, but it should control calculation of default expressions for omitted fields). Fix Nullable input/output inTSVRawformat, make this format fully compatible with inserting into TSV. Fix inserting NULLs inLowCardinality(Nullable)wheninput_format_null_as_defaultis enabled (previously default values was inserted instead of actual NULLs). Fix strings deserialization inJSONStringsEachRow/JSONCompactStringsEachRowformats (strings were parsed just until first '\n' or '\t'). Add ability to useRawescaping rule in Template input format. Add diagnostic info for JSONCompactEachRow(WithNames/WIthNamesAndTypes) input format. Fix bug with parallel parsing of-WithNamesformats in case when settingmin_chunk_bytes_for_parallel_parsingis less than bytes in a single row. #30178 (Kruglov Pavel). Allow to print/parse names and types of colums inCustomSeparatedinput/output format. Add formatsCustomSeparatedWithNames/WithNamesAndTypessimilar toTSVWithNames/WithNamesAndTypes. #31434 (Kruglov Pavel).
Why it matters
These changes address inconsistencies and code duplication among several input/output formats (TSV, CSV, JSONCompactEachRow, RowBinary, etc.) by introducing a unified interface and extending support for column names and types. This increases format completeness, correctness, and usability, improves type checking, enables parallel parsing for better performance, and fixes various bugs related to NULL handling and string deserialization.How to use it
Users can enable the new functionality by using the new format variants suffixed withWithNames and WithNamesAndTypes (e.g., CSVWithNamesAndTypes, TSVRawWithNames). Configuration settings such as input_format_with_types_use_header control the validation of types against table structure, while input_format_csv_empty_as_default adjusts how empty CSV fields are interpreted. Parallel parsing is automatically supported for these formats, and additional settings like min_chunk_bytes_for_parallel_parsing influence its behavior. For CustomSeparated formats, users can now also include names and types similarly.