v.23.12Improvement

Fix Issues with Distributed Sends and Track Async INSERT Failures in ClickHouse

Fix possible distributed sends stuck due to "No such file or directory" (during recovering a batch from disk). Fix possible issues with error_count from system.distribution_queue (in case of distributed_directory_monitor_max_sleep_time_ms >5min). Introduce profile event to track async INSERT failures - DistributedAsyncInsertionFailures. #57480 (Azat Khuzhin).
Fixes issues causing distributed sends to get stuck due to "No such file or directory" errors during batch recovery from disk, and corrects inaccuracies in error_count in system.distribution_queue when distributed_directory_monitor_max_sleep_time_ms exceeds 5 minutes. Additionally, introduces a profile event DistributedAsyncInsertionFailures to track asynchronous INSERT failures in distributed tables.

Why it matters

This feature addresses stability problems in distributed inserts by preventing batch recovery failures that lead to stuck sends. It also improves monitoring accuracy for distribution queue errors over long sleep intervals and provides visibility into asynchronous insertion failures, helping users diagnose and resolve issues in distributed data ingestion workflows.

How to use it

Users benefit from automatic fixes to batch recovery and error counting without additional configuration. To monitor async INSERT failures, users can track the new profile event DistributedAsyncInsertionFailures via ClickHouse's profile event interfaces or system monitoring tools.