v.23.4Improvement

Improvements in Data Lakes: Iceberg, DeltaLake, and Hudi Enhancements

Several improvements around data lakes: - Make Iceberg work with non-partitioned data. - Support Iceberg format version v2 (previously only v1 was supported) - Support reading partitioned data for DeltaLake/Hudi - Faster reading of DeltaLake metadata by using Delta's checkpoint files - Fixed incorrect Hudi reads: previously it incorrectly chose which data to read and therefore was able to read correctly only small size tables - Made these engines to pickup updates of changed data (previously the state was set on table creation) - Make proper testing for Iceberg/DeltaLake/Hudi using spark. #47307 (Kseniia Sumarokova).
Improved support for data lake formats including Iceberg, DeltaLake, and Hudi with enhancements in compatibility, performance, and correctness.

Why it matters

This feature addresses multiple limitations in the integration of ClickHouse with popular data lake formats. It enables Iceberg to work with non-partitioned data and supports format version v2, improves DeltaLake and Hudi reading capabilities by handling partitioned data correctly and speeding up metadata reading, fixes incorrect data reads in Hudi for larger tables, and ensures these engines dynamically pick up data updates instead of relying on a fixed state set at table creation. Overall, these improvements provide more reliable, efficient, and up-to-date querying of data lakes within ClickHouse.

How to use it

Users can utilize these enhancements by creating and querying tables with the Iceberg, DeltaLake, or Hudi table engines as usual. The support for Iceberg non-partitioned data and version 2 format is automatic. Metadata reading optimizations and update pickups are applied internally. Proper data reading, including partitioned data, is also now supported without additional configuration. Testing setups using Spark are improved for validation but do not require user intervention.