ClickHouse Dictionaries: The complete production guide

ClickHouse Dictionaries are specialized key-value data structures that replace expensive JOIN operations with sub-millisecond lookups, delivering up to 25x faster query performance while eliminating the need for ETL pipelines to external reference data. Dictionaries store mappings (key β†’ attributes) in memory-optimized layouts, enabling direct data enrichment from MySQL, PostgreSQL, Redis, HTTP APIs, and other sources without intermediate staging.

This capability transforms how analytical workloads handle dimensional data. Rather than building hash tables at query time, dictionaries pre-load reference data into optimized structures that serve lookups in O(1) time complexity. The feature was originally developed as a "silver bullet" for star-schema analytics before ClickHouse supported multi-table JOINs effectively, and remains the preferred approach for low-latency data enrichment at scale.


Why dictionaries exist and what problems they solve

Traditional JOINs in analytical databases require building hash tables at query execution time, scanning both tables, and matching rowsβ€”expensive operations that compound with query frequency. Dictionaries address this by pre-computing and caching reference data lookups that queries access repeatedly.

The core problems dictionaries solve include JOIN performance limitations where repeated hash table construction wastes compute, ETL process elimination by loading external data directly from MySQL/PostgreSQL/APIs without staging pipelines, real-time data enrichment during INSERT operations without slowing ingestion, and cross-database integration through native connectors to external systems.

Performance benchmarks demonstrate the magnitude of improvement. Testing 1 million actors joined with 100 million roles showed Hash JOIN completing in 1.084 seconds using 4.44 GiB memory, while Direct JOIN with a flat dictionary completed in 0.044 secondsβ€”a 25x speedup with significantly lower memory overhead.


Dictionary architecture and internal design

Dictionaries follow a three-component architecture: SOURCE (where data originates), LOADER (refresh mechanism controlled by LIFETIME), and LAYOUT (the in-memory data structure). When queries call dictGet() functions, they access the pre-loaded layout directly without network calls or disk I/O.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  SOURCE           β†’    LOADER         β†’    LAYOUT      β”‚
β”‚  (MySQL, HTTP,         (Refresh            (flat,      β”‚
β”‚   ClickHouse,          mechanism,           hashed,    β”‚
β”‚   File, Redis)         LIFETIME)            cache...)  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              ↓
        Query Access via dictGet('dict_name', 'attr', key)

Unlike regular tables that support full CRUD operations with disk-based storage, dictionaries are read-only structures loaded from sources with unique keys (last value wins on duplicates). Access happens through specialized dictGet functions rather than SELECT statements, and updates occur through periodic refreshes rather than direct INSERT/UPDATE operations.


Complete dictionary types reference

FLAT layout: Maximum speed for sequential keys

The FLAT layout stores data as direct arrays indexed by key value, delivering the fastest possible lookups through simple array offset accessβ€”no hash computation required.

Internal mechanics: Key value N maps directly to position N in attribute arrays. Memory consumption equals max_key_value Γ— (sizeof(attributes) + overhead), meaning a dictionary with max key 100,000 allocates 100,000 array slots regardless of actual entry count.

Performance characteristics: 4-5x faster than hashed layouts with O(1) lookups. Default maximum key value is 500,000 (configurable via max_array_size). Only supports UInt64 keys.

When to use FLAT: Auto-increment IDs, sequential keys, dense key distributions under 500K entries, maximum performance requirements.

When to avoid FLAT: Keys exceeding 500K, sparse distributions (keys like 1, 1000000, 2000000 waste massive memory), string keys, composite keys.

CREATE DICTIONARY users_flat_dict (
    user_id UInt64,
    username String,
    email String
)
PRIMARY KEY user_id
SOURCE(CLICKHOUSE(DB 'default' TABLE 'users'))
LAYOUT(FLAT(
    INITIAL_ARRAY_SIZE 1024
    MAX_ARRAY_SIZE 500000
))
LIFETIME(MIN 300 MAX 360);

-- Usage: O(1) direct array access
SELECT dictGet('users_flat_dict', 'username', toUInt64(1));

HASHED layout: General-purpose key-value storage

The HASHED layout uses hash tables for arbitrary key distributions, supporting millions of entries without the sequential key constraint.

Internal mechanics: ClickHouse's optimized HashMap implementation hashes keys to buckets with collision resolution. Separate hash tables are created for each attribute, which increases memory but enables parallel attribute access.

Performance characteristics: ~0.55 seconds per 10 million lookups (vs ~0.044s for FLAT). Memory overhead approximately 30-50% beyond raw data size. Parallel loading available via SHARDS parameter.

Memory comparison for 5 million entries with UInt64 key + Float64 attribute:

  • HASHED: 256 MiB
  • SPARSE_HASHED: 76 MiB (3x savings, 3x slower)
  • HASHED_ARRAY: 69 MiB (most efficient for multiple attributes)
CREATE DICTIONARY products_hashed_dict (
    product_id UInt64,
    name String DEFAULT 'Unknown',
    price Float64 DEFAULT 0.0
)
PRIMARY KEY product_id
SOURCE(CLICKHOUSE(TABLE 'products'))
LAYOUT(HASHED(
    SHARDS 10                    -- Parallel loading
    SHARD_LOAD_QUEUE_BACKLOG 10000
    MAX_LOAD_FACTOR 0.5          -- Memory/speed tradeoff
))
LIFETIME(MIN 300 MAX 360);

SPARSE_HASHED variant: Trades CPU for memory, using approximately 3x less memory but with 3x slower lookups. Use for memory-constrained environments with single-attribute dictionaries and lower query frequency.

HASHED_ARRAY variant: Single hash table stores indices pointing to separate attribute arrays. Most memory-efficient for dictionaries with 3+ attributes while maintaining near-HASHED query speed.


COMPLEX_KEY_HASHED: Composite and string key support

For non-UInt64 keysβ€”strings, composite multi-column keys, or natural business keysβ€”COMPLEX_KEY_HASHED provides hash table storage with tuple-based key hashing.

Key composition rules: Primary key can include multiple columns of different types. All key components must be provided in lookups using tuple() syntax.

CREATE DICTIONARY pricing_dict (
    product_id UInt64,
    region String,
    customer_tier String,
    price Float64 DEFAULT 0.0,
    discount Float64 DEFAULT 0.0
)
PRIMARY KEY product_id, region, customer_tier
SOURCE(CLICKHOUSE(TABLE 'pricing'))
LAYOUT(COMPLEX_KEY_HASHED(SHARDS 5))
LIFETIME(MIN 300 MAX 360);

-- Lookup requires tuple for composite keys
SELECT dictGet('pricing_dict', 'price', tuple(12345, 'US', 'premium'));

-- String-only key example
CREATE DICTIONARY airline_dict (
    code String,
    company String
)
PRIMARY KEY code
LAYOUT(COMPLEX_KEY_HASHED());

SELECT dictGet('airline_dict', 'company', tuple('UA'));

RANGE_HASHED: Time-varying and versioned data

RANGE_HASHED associates attribute values with date or numeric ranges, enabling point-in-time lookups for historical pricing, tax rates, exchange rates, or SCD Type 2 slowly changing dimensions.

Internal mechanics: Hash table lookup by primary key, then binary search through ordered range intervals to find the matching time period. Uses interval tree structure for efficient range queries when many intervals exist per key.

Range definition: RANGE(MIN start_column MAX end_column) specifies inclusive bounds. Supported types include Date, DateTime, DateTime64, and numeric types. Use NULL for open-ended ranges (converts to epoch or MAX_INT).

Overlapping range handling: range_lookup_strategy parameter controls behaviorβ€”min returns range with smallest range_min (default), max returns range with largest range_min.

CREATE DICTIONARY discount_dict (
    advertiser_id UInt64,
    start_date Date,
    end_date Date,
    discount_rate Float64 DEFAULT 0.0
)
PRIMARY KEY advertiser_id
SOURCE(CLICKHOUSE(TABLE 'discount_history'))
LAYOUT(RANGE_HASHED(RANGE_LOOKUP_STRATEGY 'min'))
RANGE(MIN start_date MAX end_date)
LIFETIME(MIN 300 MAX 360);

-- Point-in-time lookup with key + date
SELECT dictGet('discount_dict', 'discount_rate', 
               toUInt64(123), 
               toDate('2024-01-15'));

-- Open-ended ranges (NULL in source becomes epoch/MAX)
INSERT INTO discount_history VALUES 
    (123, NULL, '2024-03-13', 0.10),      -- Open start
    (123, '2024-03-14', NULL, 0.15);       -- Open end

CACHE layout: On-demand loading for massive datasets

The CACHE layout stores a fixed-size LRU cache rather than pre-loading all data, fetching missing keys from the source on demand. Ideal when reference data is too large for memory but access patterns concentrate on a hot subset.

Internal mechanics: Cache hits return immediately; cache misses trigger SELECT ... WHERE id IN (k1, k2, ...) to the source. Eviction follows LRU policy. Size specified in cells (rounded to power of two).

Critical requirement: Cache dictionaries only perform well with hit rates β‰₯99%. Random access patterns across the full key space result in constant source queries and poor performance.

CREATE DICTIONARY user_cache_dict (
    user_id UInt64,
    username String,
    email String
)
PRIMARY KEY user_id
SOURCE(MYSQL(
    HOST 'mysql-server' PORT 3306
    USER 'reader' PASSWORD 'secret'
    DB 'users' TABLE 'user_profiles'
))
LAYOUT(CACHE(SIZE_IN_CELLS 1000000))
LIFETIME(MIN 300 MAX 360);

SSD_CACHE variant: Stores data on local SSD with only the index in RAM, supporting much larger dictionaries than memory-only CACHE. Requires local SSD storage and accepts higher I/O latency.

LAYOUT(SSD_CACHE(
    BLOCK_SIZE 4096
    FILE_SIZE 16777216
    PATH '/var/lib/clickhouse/user_files/ssd_dict'
))

DIRECT layout: Real-time source queries

DIRECT dictionaries perform no local storageβ€”every dictGet() call queries the source directly. Zero memory footprint but latency depends entirely on source response time.

When appropriate: Data changes too frequently for caching, source is extremely fast (in-memory database, local storage), lookup frequency is low, or values needed only rarely.

CREATE DICTIONARY realtime_prices_dict (
    product_id UInt64,
    current_price Decimal(10,2),
    stock_count UInt32
)
PRIMARY KEY product_id
SOURCE(POSTGRESQL(
    HOST 'postgres-server' PORT 5432
    USER 'readonly' PASSWORD 'secret'
    DB 'inventory' TABLE 'product_prices'
))
LAYOUT(DIRECT())
LIFETIME(0);  -- No refresh needed

IP_TRIE layout: Network address lookups

IP_TRIE uses a bitwise trie (prefix tree) optimized for IP address lookups in CIDR notation, supporting both IPv4 and IPv6 with automatic longest-prefix matching.

Internal mechanics: Trie nodes represent individual bits of IP addresses. Lookup time bounded by key length (32 bits for IPv4, 128 bits for IPv6). Overlapping prefixes automatically return the most specific match.

Performance: ~3 million lookups/second per core for IPv6 with ~400K entries. ~50 MB/sec throughput per single core.

CREATE DICTIONARY geoip_trie (
    cidr String,                    -- Key in CIDR notation
    country_code String,
    city String,
    latitude Float64,
    longitude Float64
)
PRIMARY KEY cidr
SOURCE(CLICKHOUSE(TABLE 'geoip_data'))
LAYOUT(IP_TRIE())
LIFETIME(3600);

-- IPv4 lookup
SELECT dictGet('geoip_trie', 'country_code', toIPv4('8.8.8.8'));

-- IPv6 lookup
SELECT dictGet('geoip_trie', 'city', IPv6StringToNum('2620:0:870::1'));

-- Real-world usage with access logs
SELECT 
    dictGet('geoip_trie', 'country_code', tuple(client_ip)) AS country,
    count() AS requests
FROM access_logs
WHERE date = today()
GROUP BY country;

POLYGON layout: Point-in-polygon geographic queries

POLYGON dictionaries enable reverse geocodingβ€”given coordinates (x, y), find which polygon contains that point. Useful for geo-fencing, delivery zone determination, and sales territory assignment.

Layout variants:

  • POLYGON_SIMPLE: Linear scan through all polygons (slowest)
  • POLYGON_INDEX_EACH: Separate index per polygon with recursive grid
  • POLYGON_INDEX_CELL (default POLYGON): Grid cells with indexed polygon pieces (fastest)

Data format: Polygons as Array(Array(Tuple(Float64, Float64))) where first ring is outer boundary and subsequent rings are holes.

CREATE DICTIONARY region_dict (
    polygon Array(Array(Tuple(Float64, Float64))),
    region_id UInt32,
    region_name String
)
PRIMARY KEY polygon
SOURCE(CLICKHOUSE(TABLE 'region_polygons'))
LAYOUT(POLYGON(STORE_POLYGON_KEY_COLUMN 1))
LIFETIME(3600);

-- Find region containing point
SELECT dictGet('region_dict', 'region_name', tuple(5.0, 5.0));

Dictionary layout selection decision tree

Is key type UInt64-compatible?
β”œβ”€β”€ YES: Is key space dense and < 500K entries?
β”‚   β”œβ”€β”€ YES β†’ FLAT (fastest, O(1) lookups)
β”‚   └── NO: How many attributes?
β”‚       β”œβ”€β”€ 1-2 β†’ HASHED
β”‚       β”œβ”€β”€ 3+ β†’ HASHED_ARRAY
β”‚       └── Memory constrained? β†’ SPARSE_HASHED
β”‚
└── NO (String or composite key):
    └── How many attributes?
        β”œβ”€β”€ 1-2 β†’ COMPLEX_KEY_HASHED
        β”œβ”€β”€ 3+ β†’ COMPLEX_KEY_HASHED_ARRAY
        └── Memory constrained? β†’ COMPLEX_KEY_SPARSE_HASHED

Special requirements:
β”œβ”€β”€ IP lookups β†’ IP_TRIE
β”œβ”€β”€ Point-in-polygon β†’ POLYGON
β”œβ”€β”€ Time/range data β†’ RANGE_HASHED
β”œβ”€β”€ Cannot fit in RAM β†’ CACHE or SSD_CACHE
β”œβ”€β”€ Need real-time source data β†’ DIRECT
└── Regex pattern matching β†’ REGEXP_TREE

Dictionary sources: Complete reference

ClickHouse source

Load from local or remote ClickHouse tables with optional custom queries and change detection.

SOURCE(CLICKHOUSE(
    HOST 'localhost'
    PORT 9000
    USER 'default'
    PASSWORD ''
    DB 'default'
    TABLE 'source_table'
    WHERE 'active = 1'
    INVALIDATE_QUERY 'SELECT max(updated_at) FROM source_table'
))

-- Or with custom query (cannot combine with TABLE/WHERE)
SOURCE(CLICKHOUSE(
    QUERY 'SELECT u.id, u.username, d.name as dept 
           FROM users u JOIN departments d ON u.dept_id = d.id'
))

MySQL source with replica support

SOURCE(MYSQL(
    PORT 3306
    USER 'clickhouse'
    PASSWORD 'secret'
    DB 'database_name'
    TABLE 'table_name'
    WHERE 'is_active = 1'
    INVALIDATE_QUERY 'SELECT max(updated_at) FROM table_name'
))

Multiple replicas can be specified in XML with priority-based failover.

PostgreSQL source

SOURCE(POSTGRESQL(
    HOST 'postgres-server'
    PORT 5432
    USER 'readonly'
    PASSWORD 'secret'
    DB 'database_name'
    TABLE 'table_name'
    INVALIDATE_QUERY 'SELECT max(updated_at) FROM table_name'
))

HTTP source for APIs and remote data

SOURCE(HTTP(
    URL 'https://example.com/data.csv'
    FORMAT 'CSVWithNames'
))

XML configuration supports credentials and custom headers for authentication.

MongoDB source

SOURCE(MONGODB(
    HOST 'localhost'
    PORT 27017
    USER 'user'
    PASSWORD 'pass'
    DB 'database'
    COLLECTION 'collection_name'
    OPTIONS 'ssl=true&replicaSet=rs0'
))

-- Or connection URI
SOURCE(MONGODB(
    URI 'mongodb://user:password@localhost:27017/database'
    COLLECTION 'collection_name'
))

Redis source

SOURCE(REDIS(
    HOST 'redis-master'
    PORT 6379
    STORAGE_TYPE 'simple'    -- or 'hash_map' for two keys
    DB_INDEX 0
))

File source for local data

SOURCE(FILE(
    PATH '/var/lib/clickhouse/user_files/data.csv'
    FORMAT 'CSVWithNames'
))

Security note: DDL-created file dictionaries must reference files within user_files directory.

Executable source for custom scripts

XML-only configuration (security restriction prevents DDL creation):

<source>
    <executable>
        <command>/opt/scripts/lookup.py</command>
        <format>JSONEachRow</format>
        <execute_direct>1</execute_direct>
    </executable>
</source>

Executable Pool maintains persistent processes for better performanceβ€”~2x faster than spawning per request.


Dictionary lifecycle and update management

LIFETIME configuration strategies

-- Never update (static reference data)
LIFETIME(0)

-- Fixed interval
LIFETIME(300)  -- Every 5 minutes

-- Random interval (distributes cluster load)
LIFETIME(MIN 300 MAX 360)  -- Random between 5-6 minutes

Random intervals prevent all cluster nodes from updating simultaneously, avoiding source overload spikes.

Update mechanics

During updates, old dictionary version remains available for queriesβ€”no blocking occurs except during initial load. If update fails, error is logged and old version continues serving queries. Memory temporarily doubles during reload (old + new versions).

Incremental updates with update_field

For large dictionaries, UPDATE_FIELD enables incremental loading of only changed rows:

CREATE DICTIONARY incremental_dict (
    id UInt64,
    value String,
    updated_at DateTime
)
PRIMARY KEY id
SOURCE(CLICKHOUSE(
    TABLE 'source_table'
    UPDATE_FIELD updated_at  -- Only fetch rows where updated_at >= last_update
))
LAYOUT(HASHED())
LIFETIME(MIN 300 MAX 360);

Supported for Flat, Hashed, HashedArray, ComplexKeyHashed layouts with ClickHouse, MySQL, PostgreSQL, HTTP, and Executable sources.

Change detection with invalidate_query

SOURCE(CLICKHOUSE(
    TABLE 'source_table'
    INVALIDATE_QUERY 'SELECT max(updated_at) FROM source_table'
))

ClickHouse compares query results between refresh cyclesβ€”dictionary only reloads when result changes, reducing unnecessary full reloads.


Monitoring dictionaries in production

Primary monitoring query

SELECT 
    database,
    name,
    status,
    formatReadableSize(bytes_allocated) AS memory_size,
    element_count,
    load_factor,
    query_count,
    hit_rate,
    found_rate,
    loading_duration,
    last_successful_update_time,
    last_exception
FROM system.dictionaries
FORMAT Vertical;

Key metrics and alert thresholds

Metric Description Alert Condition
status Load state != 'LOADED'
bytes_allocated Memory consumption > expected size
hit_rate Cache hit ratio < 0.9 for cache layouts
found_rate Keys found percentage Dropping trend
loading_duration Reload time Increasing trend
last_exception Most recent error Non-empty

Comprehensive health check

SELECT 
    name,
    status,
    CASE 
        WHEN status = 'LOADED' THEN 'OK'
        WHEN status = 'LOADING' THEN 'LOADING'
        ELSE 'ALERT'
    END AS health,
    formatReadableSize(bytes_allocated) AS memory,
    element_count,
    round(hit_rate * 100, 2) AS hit_rate_pct,
    toDateTime(last_successful_update_time) AS last_update,
    dateDiff('minute', last_successful_update_time, now()) AS minutes_since_update,
    CASE 
        WHEN lifetime_max > 0 AND dateDiff('second', last_successful_update_time, now()) > lifetime_max * 2 
        THEN 'STALE' ELSE 'FRESH'
    END AS freshness,
    last_exception
FROM system.dictionaries
ORDER BY health DESC, bytes_allocated DESC;

Cluster-wide monitoring

SELECT hostName() AS host, name, status, element_count, bytes_allocated
FROM clusterAllReplicas('default', system.dictionaries)
WHERE status != 'LOADED';

Dictionaries versus JOINs: When to use each

Use dictionaries when

  • Reference data fits in memory (or acceptable cache hit rate)
  • Lookups are frequent and performance-critical
  • Data is relatively static (periodic refresh acceptable)
  • You need LEFT ANY JOIN semantics (one match per key)
  • Enriching data during INSERT operations
  • Key is UInt64-compatible or simple composite type

Use traditional JOINs when

  • Full JOIN semantics required (INNER, FULL, ANTI, SEMI)
  • Right-side data changes frequently mid-query
  • Data is too large for memory with poor cache hit rates
  • Multiple matches per key needed (not ANY semantics)
  • Complex join conditions beyond simple key equality

Automatic conversion

Since ClickHouse 20.4, JOINs against dictionary-backed tables automatically convert to efficient dictionary lookups:

-- This JOIN against a dictionary table...
SELECT e.*, d.name
FROM events e
LEFT ANY JOIN my_dict d ON e.id = d.id

-- ...automatically becomes dictionary lookups, no manual dictGet needed

Hierarchical dictionaries for tree structures

Mark an attribute as HIERARCHICAL to enable parent-child relationships:

CREATE DICTIONARY org_dict (
    id UInt64,
    parent_id UInt64 HIERARCHICAL,  -- Establishes hierarchy
    name String
)
PRIMARY KEY id
SOURCE(CLICKHOUSE(TABLE 'org_hierarchy'))
LAYOUT(HASHED())
LIFETIME(300);

-- Get ancestor chain (bottom-up)
SELECT dictGetHierarchy('org_dict', 5);  -- Returns [5, 2, 1]

-- Check if descendant of ancestor
SELECT dictIsIn('org_dict', 5, 1);  -- Returns 1 if 5 is under 1

-- Get direct children
SELECT dictGetChildren('org_dict', 1);  -- Returns [2, 3]

-- Get all descendants (0 = unlimited depth)
SELECT dictGetDescendants('org_dict', 1, 0);

Use cases include organizational hierarchies, geographic regions (city β†’ state β†’ country), product category trees, and account parent-child relationships.


Common pitfalls and how to avoid them

Memory exhaustion during updates: Dictionary memory doubles temporarily during refresh (old + new versions). Budget for 2x peak dictionary memory.

LIFETIME too short: Causes excessive source load. Use invalidate_query instead of frequent full reloads, or implement UPDATE_FIELD for incremental updates.

LIFETIME too long: Causes stale data. Balance freshness requirements against source load; consider incremental updates for near-real-time needs.

All cluster nodes updating simultaneously: Use MIN/MAX range (LIFETIME(MIN 300 MAX 360)) to distribute updates randomly across nodes.

Key type mismatches: Always cast keys explicitly: dictGet('dict', 'value', toUInt64(id)). Using wrong types silently returns default values.

Missing PASSWORD in source: Causes authentication failures after restart. Always include PASSWORD even if empty: PASSWORD ''.

Using dictGet for filtering large tables: Dictionary lookup per row is expensive. Use indexed columns for filtering, dictGet only for final result enrichment:

-- SLOW: Dictionary lookup on every row
SELECT * FROM events 
WHERE dictGet('geo_dict', 'country', ip) = 'US'

-- FAST: Filter first, then enrich
SELECT *, dictGet('geo_dict', 'country', ip) AS country
FROM events
WHERE ip IN (SELECT ip FROM geo_table WHERE country = 'US')

Production use cases and real-world examples

E-commerce product enrichment: Product catalogs with name, category, price loaded from MySQL, enriching order streams without JOINs.

User profile enrichment: User attributes (country, segment, registration date) added to event streams at query time.

Geo-IP lookups: IP_TRIE dictionaries mapping IP addresses to geographic locations for real-time analytics with 3 million lookups/second performance.

ML feature stores: Low-latency feature retrieval for real-time scoring, loading pre-computed features from ClickHouse or external feature stores.

Configuration management: Dynamic application configuration from HTTP APIs or databases, updating without query redeploys.

Historical pricing: RANGE_HASHED dictionaries for point-in-time price lookups across millions of price changes.


Dictionary functions quick reference

-- Basic retrieval
dictGet('dict', 'attr', key)
dictGet('dict', ('attr1', 'attr2'), key)  -- Multiple attributes

-- With default fallback
dictGetOrDefault('dict', 'attr', key, 'fallback_value')

-- Returns NULL if missing
dictGetOrNull('dict', 'attr', key)

-- Check key existence
dictHas('dict', key)

-- Hierarchical functions
dictGetHierarchy('dict', key)           -- Ancestor chain
dictIsIn('dict', child, ancestor)       -- Descendant check
dictGetChildren('dict', key)            -- Direct children
dictGetDescendants('dict', key, depth)  -- All descendants

-- Range dictionary lookups (include date parameter)
dictGet('range_dict', 'attr', key, toDate('2024-01-15'))

Cluster deployment best practices

DDL-based management (recommended over XML):

CREATE DICTIONARY my_dict ON CLUSTER 'cluster_name' (...)
SYSTEM RELOAD DICTIONARY my_dict ON CLUSTER 'cluster_name'

Shared source configuration: Point all nodes to the same external source (MySQL, PostgreSQL) for automatic consistency.

Testing before production:

  1. Create in staging environment
  2. Validate data: SELECT count() FROM my_dict
  3. Check memory: SELECT bytes_allocated FROM system.dictionaries WHERE name='my_dict'
  4. Load test: SELECT dictGet('my_dict', 'attr', number) FROM system.numbers LIMIT 1000000 FORMAT Null

Rollback strategy: Create new dictionary with different name, test thoroughly, update queries to use new dictionary, drop old dictionary after validation.


Conclusion

ClickHouse Dictionaries represent a fundamental optimization pattern for analytical workloads that require frequent reference data lookups. The 25x performance improvement over traditional JOINs, combined with direct integration to external databases and automatic refresh capabilities, makes dictionaries essential for production ClickHouse deployments handling dimensional enrichment.

Key implementation decisions center on layout selection (FLAT for small sequential keys, HASHED family for general use, CACHE for memory constraints), source configuration (matching external systems to available connectors), and lifecycle management (balancing freshness via LIFETIME against source load using invalidate queries and incremental updates).

Production success requires monitoring dictionary status, memory consumption, and hit rates through system.dictionaries, along with cluster-wide deployment using ON CLUSTER DDL or shared source configurations. The most common pitfallsβ€”memory exhaustion during updates, key type mismatches, and inefficient filtering patternsβ€”are preventable through proper capacity planning and query design patterns.

For new implementations, start with DDL-based dictionary definitions, HASHED_ARRAY layout for multi-attribute lookups, and LIFETIME with MIN/MAX ranges. Evolve toward specialized layouts (IP_TRIE, RANGE_HASHED, POLYGON) as use cases demand, and implement comprehensive monitoring from day one.