v.23.4Improvement
Improvements in Data Lakes: Iceberg, DeltaLake, and Hudi Enhancements
Several improvements around data lakes: - MakeIcebergwork with non-partitioned data. - SupportIcebergformat version v2 (previously only v1 was supported) - Support reading partitioned data forDeltaLake/Hudi- Faster reading ofDeltaLakemetadata by using Delta's checkpoint files - Fixed incorrectHudireads: previously it incorrectly chose which data to read and therefore was able to read correctly only small size tables - Made these engines to pickup updates of changed data (previously the state was set on table creation) - Make proper testing forIceberg/DeltaLake/Hudiusing spark. #47307 (Kseniia Sumarokova).
Why it matters
This feature addresses multiple limitations in the integration of ClickHouse with popular data lake formats. It enables Iceberg to work with non-partitioned data and supports format version v2, improves DeltaLake and Hudi reading capabilities by handling partitioned data correctly and speeding up metadata reading, fixes incorrect data reads in Hudi for larger tables, and ensures these engines dynamically pick up data updates instead of relying on a fixed state set at table creation. Overall, these improvements provide more reliable, efficient, and up-to-date querying of data lakes within ClickHouse.How to use it
Users can utilize these enhancements by creating and querying tables with theIceberg, DeltaLake, or Hudi table engines as usual. The support for Iceberg non-partitioned data and version 2 format is automatic. Metadata reading optimizations and update pickups are applied internally. Proper data reading, including partitioned data, is also now supported without additional configuration. Testing setups using Spark are improved for validation but do not require user intervention.