ClickHouse table design with Siphon
This document provides an overview of how to leverage Siphon, our PostgreSQL change data capture (CDC) tool, to design analytical features using PostgreSQL data in ClickHouse.
Glossary
- Organization: higher-level container for projects and groups.
- Namespace: Can point to a Group, Sub-group, User namespace, or a Project namespace (which acts as a shadow record for a Project).
- Project: Represents a GitLab Project.
Architecture Overview
Siphon facilitates the movement of data from PostgreSQL to ClickHouse using a publish-subscribe model to ensure scalability and reliability.
- Siphon Producer: Monitors the PostgreSQL logical replication slot and publishes row-level changes (
INSERT,UPDATE,DELETE) to NATS. - NATS: Acts as the message broker, decoupling the producer from the consumers.
- Siphon Consumer: Subscribes to NATS topics and writes the incoming data into ClickHouse using the
ReplacingMergeTreeengine.
Requirements
This guide assumes that the following GitLab components are up and running:
- PostgreSQL with
wal_level=logicalconfiguration. - ClickHouse
- NATS
- Siphon Producer and Siphon Consumer (for ClickHouse)
Current status: Siphon is only available for local development via GDK for database tables present in the main database. You should set up these dependencies with GDK following this guide. You can find help and discuss technical issues in the #f_siphon Slack channel.
Table Replication Example
With Siphon, you can replicate a complete PostgreSQL table to ClickHouse. This process requires at least two steps:
- ClickHouse table creation: Create the destination table in ClickHouse where Siphon replicates the data.
- Configuration change: Update Siphon configuration so it is aware of the table.
In this example, we are replicating the labels table to ClickHouse.
ClickHouse Table Creation
You can generate a ClickHouse database migration using a Rake task. This transforms the PostgreSQL schema for the specified table into a Siphon-compatible ClickHouse table definition.
bundle exec rails generate gitlab:click_house:siphon labelsGenerated CREATE TABLE statement:
CREATE TABLE IF NOT EXISTS siphon_labels
(
id Int64,
title Nullable(String),
color Nullable(String),
project_id Nullable(Int64),
created_at Nullable(DateTime64(6, 'UTC')),
updated_at Nullable(DateTime64(6, 'UTC')),
template Nullable(Bool) DEFAULT false,
description Nullable(String),
description_html Nullable(String),
type Nullable(String),
group_id Nullable(Int64),
cached_markdown_version Nullable(Int64),
lock_on_merge Bool DEFAULT false,
archived Bool DEFAULT false,
organization_id Nullable(Int64),
_siphon_replicated_at DateTime64(6, 'UTC') DEFAULT now(),
_siphon_deleted Bool DEFAULT FALSE
)
ENGINE = ReplacingMergeTree(_siphon_replicated_at, _siphon_deleted)
PRIMARY KEY idThe generator includes all columns from the PostgreSQL table and adds two extra fields to track record updates (_siphon_replicated_at and _siphon_deleted). In case you want to skip some columns from the ClickHouse schema, you can do that by modifying the generated CREATE TABLE statement. Siphon will only insert data for columns which are available in ClickHouse.
By default, the generator uses the same primary key as PostgreSQL. You may modify this to suit your querying needs, provided the following two rules are met:
- All primary key columns from the PostgreSQL table must be included in the ClickHouse primary key.
- You may add extra columns to the primary key definition, provided the values of those columns never change.
Examples for such columns:
- Sharding keys.
- Columns which are not modifiable from the application, for example updating the
target_idandtarget_typevalues of anotesrecord.
Siphon and ClickHouse does not do any validations regarding the correctness of the primary key configuration.
Siphon Configuration
After the table is created, you should update your Siphon configuration in gdk.yml and add your table:
siphon:
enabled: true
tables:
- organizations
- namespaces
- projects
- labelsAfter this change, you can restart the Siphon-related processes in your GDK and eventually you should see data synchronized to your table.
gdk restart siphon-clickhouse-consumer siphon-producer-main-dbThe updated configuration triggers Siphon to perform two actions:
- Initial Snapshot: Take a snapshot of the current data to populate the
siphon_labelstable. - Continuous Replication: Capture all subsequent record changes in real-time via logical replication and apply them to the
siphon_labelstable.
Depending on the table size, the initial data snapshot may take several minutes or hours to complete.
You can inspect the table from rails console:
ClickHouse::Client.select('SELECT * FROM siphon_labels', :main)Or you can start ClickHouse client and run the following query:
SELECT * FROM siphon_labels;Hierarchy Denormalization Examples
One of the biggest challenges at GitLab is querying data in the Namespace/Group hierarchy in PostgreSQL. Oftentimes, this is not efficiently possible, and a database query may time out when querying a larger data range.
There has been extensive research regarding hierarchy-based lookups in ClickHouse, and there are several approaches to making these possible with Siphon and ClickHouse.
In PostgreSQL, we have introduced a special, denormalized array column called traversal_ids. This contains all ancestor namespace IDs, which allows us to look up all entities in the hierarchy more efficiently. Hierarchy optimization in ClickHouse means that this denormalized column can be part of any table where there is a direct reference to a namespace or project.
traversal_path Structure
In ClickHouse, traversal_ids is transformed to traversal_path, a slash (/) separated string.
1/9970/../92345/- First integer:
organization_id(allows organization-level lookups). - Second integer: Top-level namespace ID (in this example:
gitlab-org). - Last integer: Pointer to a
Group(or subgroup) or to aProjectNamespace(references aProjectrecord).
How to Determine the traversal_path
Three ClickHouse dictionaries (hashmap cache) were added to the database for quick traversal_path lookup.
Look up traversal_path for a namespaces.id value:
SELECT dictGet('namespace_traversal_paths_dict', 'traversal_path', 9970);Look up traversal_path for a projects.id value:
SELECT dictGet('project_traversal_paths_dict', 'traversal_path', 278964); -- GitLab project idLook up traversal_path for an organizations.id value:
SELECT dictGet('organization_traversal_paths_dict', 'traversal_path', 1); -- "main" organization idThese dictionaries allow mostly O(1) access for looking up traversal_path values. Keep in mind that the returned data is cached and may not always be consistent. See the Consistency Guarantees section later in this document.
Create a Hierarchy-Lookup Optimized Table
You can create a Siphon table that is optimized for hierarchical lookups with the following Rake task:
bundle exec rails generate gitlab:click_house:siphon labels --with-traversal-pathAs a result, the generated table contains one extra column (traversal_path) with the dictionary lookup as the default value. Using the $table_name.yml table description file, the script will automatically detect the configured sharding keys and set the correct dictionary lookups.
Generated CREATE TABLE statement:
CREATE TABLE IF NOT EXISTS siphon_labels
(
id Int64,
title Nullable(String),
color Nullable(String),
project_id Nullable(Int64),
created_at Nullable(DateTime64(6, 'UTC')),
updated_at Nullable(DateTime64(6, 'UTC')),
template Nullable(Bool) DEFAULT false,
description Nullable(String),
description_html Nullable(String),
type Nullable(String),
group_id Nullable(Int64),
cached_markdown_version Nullable(Int64),
lock_on_merge Bool DEFAULT false,
archived Bool DEFAULT false,
organization_id Nullable(Int64),
traversal_path String DEFAULT multiIf(
coalesce(group_id, 0) != 0, dictGetOrDefault('namespace_traversal_paths_dict', 'traversal_path', group_id, '0/'),
coalesce(project_id, 0) != 0, dictGetOrDefault('project_traversal_paths_dict', 'traversal_path', project_id, '0/'),
coalesce(organization_id, 0) != 0, dictGetOrDefault('organization_traversal_paths_dict', 'traversal_path', organization_id, '0/'),
'0/'
),
_siphon_replicated_at DateTime64(6, 'UTC') DEFAULT now(),
_siphon_deleted Bool DEFAULT FALSE,
PROJECTION pg_pkey_ordered (
SELECT *
ORDER BY id
)
)
ENGINE = ReplacingMergeTree(_siphon_replicated_at, _siphon_deleted)
PRIMARY KEY (traversal_path, id)
SETTINGS deduplicate_merge_projection_mode = 'rebuild'The primary keys for the table are: (traversal_path, id). Additionally, the generator adds an extra projection for looking up records via the original PostgreSQL primary keys (name: pg_pkey_ordered). This is needed for Siphon to quickly locate existing records when replicating DELETE statements.
The deduplicate_merge_projection_mode setting
For tables utilizing projections, we explicitly set deduplicate_merge_projection_mode to rebuild. This configuration, which is automatically injected by our generator script, is critical for maintaining strict consistency between the parent table and its projections. During a merge, specifically when rows are being deduplicated, this mode instructs ClickHouse to discard the existing projection data and recalculate it from scratch using the newly deduplicated rows. Without this setting, the projection risks becoming stale or mathematically inconsistent with the source data.
Querying the Table
When querying hierarchy-optimized tables it’s important to ensure that the traversal_path filter is part of the query and duplicates are filtered during query time.
Querying exact traversal_path match: count items in a Group
SELECT COUNT(*)
FROM (
SELECT
argMax(title, _siphon_replicated_at) AS title,
argMax(_siphon_deleted, _siphon_replicated_at) AS _siphon_deleted,
id,
traversal_path
FROM siphon_labels
WHERE
traversal_path = '1/9970/'
GROUP BY traversal_path, id
)
WHERE _siphon_deleted = falseIf we only know the group_id, we can use a dictionary lookup as the filter:
SELECT COUNT(*)
FROM (
SELECT
argMax(title, _siphon_replicated_at) AS title,
argMax(_siphon_deleted, _siphon_replicated_at) AS _siphon_deleted,
id,
traversal_path
FROM siphon_labels
WHERE
traversal_path = dictGetOrDefault('namespace_traversal_paths_dict', 'traversal_path', 9970, '0/')
GROUP BY traversal_path, id
)
WHERE _siphon_deleted = falseQuerying self and descendants: count items in a Group hierarchy
SELECT COUNT(*)
FROM (
SELECT
argMax(title, _siphon_replicated_at) AS title,
argMax(_siphon_deleted, _siphon_replicated_at) AS _siphon_deleted,
id,
traversal_path
FROM siphon_labels
WHERE
startsWith(traversal_path, '1/9970/')
GROUP BY traversal_path, id
)
WHERE _siphon_deleted = falseDeduplication in Query Time
Siphon tables use the ReplacingMergeTree table engine. Unlike PostgreSQL, ClickHouse does not enforce uniqueness constraints at the time of insertion for performance reasons. Instead, the engine merges rows with the same primary key in the background.
Because this merge is asynchronous, queries may return multiple versions of the same row. To handle this, always apply query-time deduplication:
- Group By the primary key.
- Use
argMax(column, _siphon_replicated_at)to select the latest value. - Filter out rows where
_siphon_deletedis true.
When writing queries against Siphon tables, always think about how to deduplicate the results:
SELECT title, id FROM labels ORDER BY title LIMIT 5The query above is not deduplicated and it may return duplicated rows, one way of deduplicating the query is the following:
SELECT title, id
FROM (
SELECT
argMax(title, _siphon_replicated_at) AS title, -- take the latest `title` value
argMax(_siphon_deleted, _siphon_replicated_at) AS _siphon_deleted, -- take the latest deletion status
id,
traversal_path
FROM siphon_labels
GROUP BY traversal_path, id -- GROUP BY primary key
)
WHERE
_siphon_deleted = false -- Filter out deleted rows
ORDER BY title
LIMIT 5If unsure what are the version and deleted columns, you can always inspect the table definition:
SHOW CREATE TABLE siphon_labels;Find the ENGINE clause and look at the arguments:
ENGINE = ReplacingMergeTree(_siphon_replicated_at, _siphon_deleted)Near Real-Time Denormalization via NULL Engine
The GitLab PostgreSQL schema is optimized for normalized, transactional workloads. However, ClickHouse performance scales best with denormalized (wide) tables. While ClickHouse supports JOINs, they are computationally expensive. As a rule of thumb, if a JOIN more than triples your query I/O, you should denormalize at ingestion time.
This strategy is ideal for enrichment where relationships are 1:1 or 1:N (where N is small, e.g., < 500 rows), such as adding Assignee and Reviewer data directly to a Merge Request record.
High-Level Overview
To avoid the cost of storing raw data twice, we utilize a “pass-through” pipeline:
- The Ingestion Point (Siphon Table): Define a “landing” table using the
NULLengine. This engine acts as a bufferless pipe that receives data from Siphon and triggers downstream logic, immediately discarding the raw bytes to save disk space. - The Storage Table (Destination): Define the final
ReplacingMergeTreetable. This schema includes both the source columns and the extra columns intended for denormalized data. - The Logic Engine (Materialized View): Define a Materialized View (MV) that “watches” the
NULLtable. Upon every insert, the MV performs theJOINlogic against existing reference tables to fetch enrichment data. - The Landing: The MV pushes the enriched, “wide” rows into the storage table.
Implementation Example: Denormalized Merge Request Reviewers
Architecture Overview:
graph TD
subgraph Ingestion_Source [Data Source]
CDC[Siphon / CDC Data]
end
subgraph ClickHouse_Engine [Pipeline]
NullTable[<b>siphon_merge_requests</b><br/>ENGINE = Null]
MV{<b>merge_requests_mv</b><br/>Materialized View}
RefTable[(<b>siphon_merge_request_reviewers</b><br/>ENGINE = ReplacingMergeTree)]
end
subgraph Storage [Destination]
FinalTable[(<b>merge_requests</b><br/>ENGINE = ReplacingMergeTree)]
end
%% Data Flow
CDC -->|INSERT| NullTable
NullTable -->|Triggers| MV
RefTable -.->|Lookup/JOIN/Format| MV
MV -->|Enriched Data| FinalTable
%% Styling
style NullTable fill:#f9f,stroke:#333,stroke-width:2px
style FinalTable fill:#00ff00,color:#000,stroke:#333,stroke-width:2px
style MV fill:#fff4dd,stroke:#d4a017,stroke-width:2px
Note: For the sake of simplicity, unnecessary columns are omitted from the merge_requests table.
1. Create the NULL siphon_merge_requests table
First, generate a Siphon table for merge_requests:
bundle exec rails generate gitlab:click_house:siphon merge_requests --with-traversal-pathThe resulting schema (after removing the projection and non-essential columns) appears as follows:
CREATE TABLE IF NOT EXISTS siphon_merge_requests
(
id Int64 CODEC(DoubleDelta, ZSTD),
target_branch String,
source_branch String,
source_project_id Nullable(Int64),
author_id Nullable(Int64),
title Nullable(String),
created_at DateTime64(6, 'UTC') CODEC(Delta, ZSTD(1)),
updated_at DateTime64(6, 'UTC') CODEC(Delta, ZSTD(1)),
target_project_id Int64,
iid Nullable(Int64),
description Nullable(String),
traversal_path String DEFAULT multiIf(coalesce(target_project_id, 0) != 0, dictGetOrDefault('project_traversal_paths_dict', 'traversal_path', target_project_id, '0/'), '0/') CODEC(ZSTD(3)),
_siphon_replicated_at DateTime64(6, 'UTC') DEFAULT now() CODEC(ZSTD(1)),
_siphon_deleted Bool DEFAULT FALSE CODEC(ZSTD(1))
)
ENGINE = Null;Key Notes:
- The
ENGINEis explicitly set toNull. - Column compression remains defined but has no performance impact as data is never written to disk by this engine.
2. Create the siphon_merge_request_reviewers table
Generate the reviewers table:
bundle exec rails generate gitlab:click_house:siphon merge_request_reviewers --with-traversal-pathThe generated schema:
CREATE TABLE IF NOT EXISTS siphon_merge_request_reviewers
(
id Int64 CODEC(DoubleDelta, ZSTD),
user_id Int64,
merge_request_id Int64,
created_at DateTime64(6, 'UTC') CODEC(Delta, ZSTD(1)),
state Int8 DEFAULT 0,
project_id Int64,
traversal_path String DEFAULT multiIf(coalesce(project_id, 0) != 0, dictGetOrDefault('project_traversal_paths_dict', 'traversal_path', project_id, '0/'), '0/') CODEC(ZSTD(3)),
_siphon_replicated_at DateTime64(6, 'UTC') DEFAULT now() CODEC(ZSTD(1)),
_siphon_deleted Bool DEFAULT FALSE CODEC(ZSTD(1)),
PROJECTION pg_pkey_ordered (
SELECT *
ORDER BY id
)
)
ENGINE = ReplacingMergeTree(_siphon_replicated_at, _siphon_deleted)
PRIMARY KEY (traversal_path, merge_request_id, id)
SETTINGS deduplicate_merge_projection_mode = 'rebuild', index_granularity = 1024;Modifications made:
merge_request_idis included in the primary key to optimize reviewer lookups per MR.index_granularityis decreased to 1024 to improveJOINperformance during the MV trigger.
3. Create the merge_requests destination table
This table contains all columns from siphon_merge_requests plus additional denormalized columns (in this case, for reviewers). Reviewers are modeled as an array of Tuples, where each tuple contains the user_id and the review state.
CREATE TABLE IF NOT EXISTS merge_requests
(
id Int64 CODEC(DoubleDelta, ZSTD),
target_branch String,
source_branch String,
source_project_id Nullable(Int64),
author_id Nullable(Int64),
title Nullable(String),
created_at DateTime64(6, 'UTC') CODEC(Delta, ZSTD(1)),
updated_at DateTime64(6, 'UTC') CODEC(Delta, ZSTD(1)),
target_project_id Int64,
iid Nullable(Int64),
description Nullable(String),
traversal_path String DEFAULT multiIf(coalesce(target_project_id, 0) != 0, dictGetOrDefault('project_traversal_paths_dict', 'traversal_path', target_project_id, '0/'), '0/') CODEC(ZSTD(3)),
reviewers Array(Tuple(UInt64, Int8)),
_siphon_replicated_at DateTime64(6, 'UTC') DEFAULT now() CODEC(ZSTD(1)),
_siphon_deleted Bool DEFAULT FALSE CODEC(ZSTD(1)),
PROJECTION pg_pkey_ordered (
SELECT *
ORDER BY id
)
)
ENGINE = ReplacingMergeTree(_siphon_replicated_at, _siphon_deleted)
PRIMARY KEY (traversal_path, id)
SETTINGS deduplicate_merge_projection_mode = 'rebuild'4. Creating the JOIN Materialized View
The Materialized View ensures that every insert into siphon_merge_requests is enriched with current reviewer data.
CREATE MATERIALIZED VIEW merge_requests_mv TO merge_requests
AS WITH
cte AS
(
-- Store the current INSERT block for siphon_merge_requests table
SELECT *
FROM siphon_merge_requests
),
collected_reviewers AS
(
SELECT
traversal_path,
merge_request_id,
groupArray((user_id, state)) AS reviewers -- Build the Array(Tuple(UInt64, Int8)) value
FROM
(
-- Load the deduplicated reviewer records for the given merge request values
SELECT
traversal_path,
merge_request_id,
id,
argMax(user_id, _siphon_replicated_at) AS user_id,
argMax(state, _siphon_replicated_at) AS state,
argMax(_siphon_deleted, _siphon_replicated_at) AS _siphon_deleted
FROM siphon_merge_request_reviewers
WHERE (traversal_path, merge_request_id) IN (
SELECT traversal_path, id
FROM cte
)
GROUP BY traversal_path, merge_request_id, id
HAVING _siphon_deleted = false
)
GROUP BY traversal_path, merge_request_id
)
SELECT
cte.id AS id,
cte.target_branch AS target_branch,
cte.source_branch AS source_branch,
cte.source_project_id AS source_project_id,
cte.author_id AS author_id,
cte.title AS title,
cte.created_at AS created_at,
cte.updated_at AS updated_at,
cte.target_project_id AS target_project_id,
cte.iid AS iid,
cte.description AS description,
cte.traversal_path AS traversal_path,
collected_reviewers.reviewers AS reviewers,
cte._siphon_replicated_at AS _siphon_replicated_at,
cte._siphon_deleted AS _siphon_deleted
FROM cte
LEFT JOIN collected_reviewers ON collected_reviewers.merge_request_id = cte.id AND collected_reviewers.traversal_path = cte.traversal_pathSiphon Configuration and the “Chicken-and-Egg” Problem
Materialized Views trigger on inserts to the source table (siphon_merge_requests). However, if an update occurs in the siphon_merge_request_reviewers table, the denormalized merge_requests table will not automatically update.
Because Siphon replicates tables independently, transactional dependencies from PostgreSQL are not natively preserved. To resolve this, Siphon includes a “refresh” callback feature. When a downstream record (reviewer) changes, Siphon can trigger an asynchronous “refresh” of the parent record (Merge Request). This re-inserts the parent row into the Null table after a short delay, re-triggering the MV and ensuring it picks up the latest reviewer state.
This is conceptually similar to the .touch method in Ruby on Rails, executed asynchronously across a collection of records.
Siphon ClickHouse consumer configuration:
streams:
# Note: to properly work with hierarchy de-normalization, the following tables also need to be replicated: namespaces, projects, organizations
- identifier: merge_requests
subject: merge_requests
target: siphon_merge_requests
# required for handling refresh callbacks from downstream tables
enable_refresh_package: true
- identifier: merge_request_reviewers
subject: merge_request_reviewers
target: siphon_merge_request_reviewers
dedup_by:
- id
clickhouse:
# connectivity configuration is disabled
refresh_on_change:
- source_stream_identifier: merge_request_reviewers
source_keys:
- merge_request_id
target_stream_identifier: merge_requests
target_keys:
- id
# dedup_ configuration is needed for properly applying DELETE and UPDATE events
dedup_config:
- stream_identifier: merge_requests
dedup_by_table: merge_requests
dedup_by:
- id
- stream_identifier: merge_request_reviewers
dedup_by:
- idQuerying the Table
After the configuration is finished and data is replicated to the ClickHouse tables we can run a query on the merge_requests table. The following query counts Merge Requests for a project (id = 12) that are being reviewed by a specific user (id = 73) with a status of “requested changes” (state = 3).
SELECT COUNT(*)
FROM (
SELECT
id,
argMax(reviewers, _siphon_replicated_at) AS reviewers,
argMax(_siphon_deleted, _siphon_replicated_at) AS _siphon_deleted
FROM merge_requests
WHERE
-- Filter for the project
startsWith(traversal_path, dictGetOrDefault('project_traversal_paths_dict', 'traversal_path', 12, '0/'))
GROUP BY traversal_path, id
HAVING _siphon_deleted = false
)
WHERE
-- x.1: user id filter
-- x.2: review state filter
arrayExists(x -> x.1 = 73 AND x.2 = 3, reviewers);Note: Array data types may add extra overhead during parsing and filtering. When denormalized data is a simple list of IDs without associated state, a delimited string field (e.g., '/user_id1/user_id2/') combined with hasSubstr can offer higher performance.
Consistency Guarantees
With all denormalizations, there are trade-offs. The Siphon-based hierarchy denormalization provides the following consistency guarantees under normal operation (assuming that the namespaces and projects tables have finished their initial data snapshot):
- Records in hierarchy-denormalized tables will be inserted, updated, and deleted correctly even when the column on which the hierarchy lookup is based has changed (e.g.,
namespace_idwas updated).- Note: If the hierarchy lookup column (
namespace_id) is a sharding key, these columns never change in PostgreSQL by design.
- Note: If the hierarchy lookup column (
- When one of the namespace record references in the
traversal_pathchanges (e.g., a subgroup/project is moved or deleted):- Eventual consistency is enforced via periodical consistency check jobs. The system aims to resolve these consistency issues in 5 minutes for large tables (note: this is to be implemented within this issue).
In development, consistency issues may appear more often as record creation may happen very close to the project or group creation. In these cases, the eventual consistency enforcement should resolve the problems in seconds or minutes (configurable).
Working in Cells Environment
Siphon-replicated database tables are by design support the Cells architecture. Assuming that Siphon is configured on all cells, when an organization or group is moved to another cell, no extra data migrations are needed for ClickHouse.
The source of truth is always PostgreSQL which means that all record changes within the relocation process (organization movement) are replicated:
- Inserting data from Cell A to Cell B.
- Removing old/stale data from Cell A.
These changes are picked up by the Siphon producer via the logical replication stream and are eventually applied to ClickHouse by the Siphon consumer.