v.22.6Improvement

Apply input_format_max_rows_to_read_for_schema_inference Globally and Increase Default to 25000

Apply setting input_format_max_rows_to_read_for_schema_inference for all read rows in total from all files in globs. Previously setting input_format_max_rows_to_read_for_schema_inference was applied for each file in glob separately and in case of huge number of nulls we could read first input_format_max_rows_to_read_for_schema_inference rows from each file and get nothing. Also increase default value for this setting to 25000. #37332 (Kruglov Pavel).
The input_format_max_rows_to_read_for_schema_inference setting now limits the total number of rows read from all files combined when using globs, rather than applying the limit separately to each file.

Why it matters

Previously, when reading multiple files using globs, the schema inference read up to input_format_max_rows_to_read_for_schema_inference rows per file separately. This could lead to inefficient schema inference if many files contained mostly null values, resulting in reading a large amount of data but still failing to deduce the schema correctly. The change improves schema inference accuracy and efficiency by applying the row limit across all files cumulatively, reducing redundant reads and potential null-value issues.

How to use it

Users do not need to change their usage explicitly as the setting behavior is updated internally. They can control the maximum rows used for schema inference when reading input formats by setting input_format_max_rows_to_read_for_schema_inference. The default value has been increased to 25000 to provide better schema inference out of the box. For example:

SET input_format_max_rows_to_read_for_schema_inference = 25000;
SELECT FROM file('data_.csv', 'CSV');