v.22.1New Feature

Added hdfsCluster Function for Parallel HDFS File Processing in Clusters

Published: November 25, 2025

Added table function hdfsCluster which allows processing files from HDFS in parallel from many nodes in a specified cluster, similarly to s3Cluster. #32400 (Zhichang Yu).

Added the hdfsCluster table function to enable parallel processing of files stored in HDFS across multiple nodes within a specified cluster, similar to the existing s3Cluster function.

Why it matters

This feature addresses the need for efficient distributed processing of large datasets stored in HDFS by leveraging cluster-wide parallelism in ClickHouse. It improves performance and scalability when querying HDFS data in a multi-node environment.

How to use it

Users can utilize the hdfsCluster table function in their SQL queries to process HDFS files in parallel across cluster nodes. The function is used similarly to s3Cluster and requires specifying the target cluster where the HDFS files reside.

Related resources

Pull Request #32400