v.22.6Improvement
Apply input_format_max_rows_to_read_for_schema_inference Globally and Increase Default to 25000
Apply settinginput_format_max_rows_to_read_for_schema_inferencefor all read rows in total from all files in globs. Previously settinginput_format_max_rows_to_read_for_schema_inferencewas applied for each file in glob separately and in case of huge number of nulls we could read firstinput_format_max_rows_to_read_for_schema_inferencerows from each file and get nothing. Also increase default value for this setting to 25000. #37332 (Kruglov Pavel).
Why it matters
Previously, when reading multiple files using globs, the schema inference read up toinput_format_max_rows_to_read_for_schema_inference rows per file separately. This could lead to inefficient schema inference if many files contained mostly null values, resulting in reading a large amount of data but still failing to deduce the schema correctly. The change improves schema inference accuracy and efficiency by applying the row limit across all files cumulatively, reducing redundant reads and potential null-value issues.How to use it
Users do not need to change their usage explicitly as the setting behavior is updated internally. They can control the maximum rows used for schema inference when reading input formats by settinginput_format_max_rows_to_read_for_schema_inference. The default value has been increased to 25000 to provide better schema inference out of the box. For example:SET input_format_max_rows_to_read_for_schema_inference = 25000;
SELECT FROM file('data_.csv', 'CSV');