v.22.4Improvement
Multiple Enhancements to Schema Inference for Various Data Formats
Multiple improvements for schema inference. Use some tweaks and heuristics to determine numbers, strings, arrays, tuples and maps in CSV, TSV and TSVRaw data formats. Add settinginput_format_csv_use_best_effort_in_schema_inferencefor CSV format that enables/disables using these heuristics, if it's disabled, we treat everything as string. Add similar settinginput_format_tsv_use_best_effort_in_schema_inferencefor TSV/TSVRaw format. These settings are enabled by default. - Add Maps support for schema inference in Values format. - Fix possible segfault in schema inference in Values format. - Allow to skip columns with unsupported types in Arrow/ORC/Parquet formats. Add corresponding settings for it:input_format_{parquet|orc|arrow}_skip_columns_with_unsupported_types_in_schema_inference. These settings are disabled by default. - Allow to convert a column with type Null to a Nullable column with all NULL values in Arrow/Parquet formats. - Allow to specify column names in schema inference via settingcolumn_names_for_schema_inferencefor formats that don't contain column names (like CSV, TSV, JSONCompactEachRow, etc) - Fix schema inference in ORC/Arrow/Parquet formats in terms of working with Nullable columns. Previously all inferred types were not Nullable and it blocked reading Nullable columns from data, now it's fixed and all inferred types are always Nullable (because we cannot understand that column is Nullable or not by reading the schema). - Fix schema inference in Template format with CSV escaping rules. #35582 (Kruglov Pavel).
Why it matters
This feature enhances the accuracy and flexibility of schema inference in ClickHouse by using heuristics to better detect data types such as numbers, strings, arrays, tuples, maps, and Nullable columns. It addresses limitations in handling unsupported types, improves stability by fixing segmentation faults in the Values format, and allows users to control schema inference behavior through new settings—thereby making data ingestion more reliable and user-friendly.How to use it
Users can enable or disable enhanced schema inference heuristics for CSV, TSV, and TSVRaw formats with the settingsinput_format_csv_use_best_effort_in_schema_inference and input_format_tsv_use_best_effort_in_schema_inference (enabled by default). To skip columns with unsupported types during schema inference in Arrow, ORC, and Parquet formats, users can toggle input_format_parquet_skip_columns_with_unsupported_types_in_schema_inference, input_format_orc_skip_columns_with_unsupported_types_in_schema_inference, and input_format_arrow_skip_columns_with_unsupported_types_in_schema_inference (disabled by default). Additionally, specifying column names in formats lacking them is possible via the column_names_for_schema_inference setting.