v.23.9New Feature

Improve JSON Schema Inference and Type Handling in ClickHouse

Improve schema inference from JSON formats: 1) Now it's possible to infer named Tuples from JSON objects without experimental JSON type under a setting input_format_json_try_infer_named_tuples_from_objects in JSON formats. Previously without experimental type JSON we could only infer JSON objects as Strings or Maps, now we can infer named Tuple. Resulting Tuple type will conain all keys of objects that were read in data sample during schema inference. It can be useful for reading structured JSON data without sparse objects. The setting is enabled by default. 2) Allow parsing JSON array into a column with type String under setting input_format_json_read_arrays_as_strings. It can help reading arrays with values with different types. 3) Allow to use type String for JSON keys with unkown types (null/[]/{}) in sample data under setting input_format_json_infer_incomplete_types_as_strings. Now in JSON formats we can read any value into String column and we can avoid getting error Cannot determine type for column 'column_name' by first 25000 rows of data, most likely this column contains only Nulls or empty Arrays/Maps during schema inference by using type String for unknown types, so the data will be read successfully. #54427 (Kruglov Pavel).
Enhanced JSON schema inference in ClickHouse with improved support for named Tuples, arrays as strings, and handling of incomplete JSON types.

Why it matters

This feature improves the schema inference process for JSON formats by enabling more accurate and flexible type detection. It allows named Tuples to be inferred from JSON objects without relying on experimental JSON types, helps with reading JSON arrays as strings to accommodate heterogeneous array values, and permits treating unknown or incomplete JSON types as strings. These enhancements reduce schema inference errors and improve compatibility with diverse JSON data structures.

How to use it

The feature is controlled via three new settings in JSON input formats:

- Enable named Tuple inference from JSON objects with input_format_json_try_infer_named_tuples_from_objects (enabled by default).
- Read JSON arrays as strings by enabling input_format_json_read_arrays_as_strings.
- Treat incomplete or unknown JSON types as strings using input_format_json_infer_incomplete_types_as_strings.

Users can configure these settings when reading JSON data to improve schema inference and avoid errors related to null or heterogeneous JSON values.