v.22.4Improvement

Multiple Enhancements to Schema Inference for Various Data Formats

Published: November 25, 2025

Multiple improvements for schema inference. Use some tweaks and heuristics to determine numbers, strings, arrays, tuples and maps in CSV, TSV and TSVRaw data formats. Add setting input_format_csv_use_best_effort_in_schema_inference for CSV format that enables/disables using these heuristics, if it's disabled, we treat everything as string. Add similar setting input_format_tsv_use_best_effort_in_schema_inference for TSV/TSVRaw format. These settings are enabled by default. - Add Maps support for schema inference in Values format. - Fix possible segfault in schema inference in Values format. - Allow to skip columns with unsupported types in Arrow/ORC/Parquet formats. Add corresponding settings for it: input_format_{parquet|orc|arrow}_skip_columns_with_unsupported_types_in_schema_inference. These settings are disabled by default. - Allow to convert a column with type Null to a Nullable column with all NULL values in Arrow/Parquet formats. - Allow to specify column names in schema inference via setting column_names_for_schema_inference for formats that don't contain column names (like CSV, TSV, JSONCompactEachRow, etc) - Fix schema inference in ORC/Arrow/Parquet formats in terms of working with Nullable columns. Previously all inferred types were not Nullable and it blocked reading Nullable columns from data, now it's fixed and all inferred types are always Nullable (because we cannot understand that column is Nullable or not by reading the schema). - Fix schema inference in Template format with CSV escaping rules. #35582 (Kruglov Pavel).

Improved schema inference for multiple data formats including CSV, TSV, TSVRaw, Values, Arrow, ORC, and Parquet with enhanced heuristics, support for new types, and added configuration options.

Why it matters

This feature enhances the accuracy and flexibility of schema inference in ClickHouse by using heuristics to better detect data types such as numbers, strings, arrays, tuples, maps, and Nullable columns. It addresses limitations in handling unsupported types, improves stability by fixing segmentation faults in the Values format, and allows users to control schema inference behavior through new settings—thereby making data ingestion more reliable and user-friendly.

How to use it

Users can enable or disable enhanced schema inference heuristics for CSV, TSV, and TSVRaw formats with the settings input_format_csv_use_best_effort_in_schema_inference and input_format_tsv_use_best_effort_in_schema_inference (enabled by default). To skip columns with unsupported types during schema inference in Arrow, ORC, and Parquet formats, users can toggle input_format_parquet_skip_columns_with_unsupported_types_in_schema_inference, input_format_orc_skip_columns_with_unsupported_types_in_schema_inference, and input_format_arrow_skip_columns_with_unsupported_types_in_schema_inference (disabled by default). Additionally, specifying column names in formats lacking them is possible via the column_names_for_schema_inference setting.

Related resources

Pull Request #35582