CDC is an important part of data processing. Using CDC you can achieve many goals from simple data replication to audit and complex ETL jobs. But implementing CDC is still a tough task (especially when considering not only happy-path). In this article I want to show different ways of implementing CDC. Also, I’ll try to not only show how to setup each variant, but also compare them with each other and highlight pros and cons of each option.
Prerequisite
Before we start, let’s look at required environment. Most examples can be reproduced using the only postgres instance without any additional setups, and it may look like this:
1services:
2 database:
3 image: postgres:17
4 ports:
5 - "5432:5432"
6 healthcheck:
7 test: ["CMD-SHELL", "pg_isready -U postgres -d postgres"]
8 interval: 10s
9 timeout: 5s
10 retries: 5
11 start_period: 10s
12 environment:
13 POSTGRES_PASSWORD: postgres
14 POSTGRES_USER: postgres
15 POSTGRES_DB: postgres
If for any CDC option will additional components will be required it will be highlighted.
Change Data Capture (CDC)
In one of the previous article definition for CDC was described, but for clarity let’s shortly repeat it here again. Change Data Capture (CDC) is a technique used in data integration and database systems to identify, track, and deliver only the data that has changed since the last capture. This enables efficient, real-time or near-real-time data synchronization, replication, and analytics without processing entire datasets repeatedly.
CDC techniques
As we already know, CDC is a technique that allows to handle data changes. Depending on task and it’s requirements you may or may not want to track every change of every row. This is an important point, because correctly using it you can achieve your goal using simple solution (looking ahead a little, why set up complex solution which track every insert, update and delete if you know that your data is strictly append-only and volume of that data is not too high)
Select-based
The so-called select-based CDC type is one of the most simple method for implementation. Under the hood it requires:
- Table with monotonically increasing field
- Query which will be used to fetch updates using monotonically increasing field
What does monotonically increasing means? In a simple words it means that value of such field should never decrease – each new value should be equal or bigger than the previous one. More formal: for each a & b, if a < b, then f(a) <= f(b).
Why this is important? Without such field we can’t find values that were updated and consequently can’t fetch and process those rows.
Consider this table:
1CREATE TABLE data (
2 id UUID PRIMARY KEY,
3 value TEXT,
4 updated_at TIMESTAMP WITH TIMEZONE NOT NULL
5);
In this table field updated_at will be our monotonically increasing field. Updates can be fetched using the next simple query:
1SELECT id, value, updated_at FROM data WHERE updated_at > {last processed point};
last processed point – is the maximum value of updated_at from the last processed batch.
The problem of this method is that not all events can be handled:
- DELETE events can’t be handled because SELECT query will return only existing rows. Technically you can find out which rows were deleted, but it will require to fetch all data from the source and compare it with the saved snapshot in the target.
- Not all UPDATE events can be handled. If a single row was updated multiple times but only one fetch was done then only the last version of this row will be saved.
- Some events may even be lost. Even so we use in our example
updated_atthere maybe be (and will) a case when multiple updated were done in different transactions but out SELECT query will see only committed updates.
Also, there is another drawbacks of this method:
- Query overhead. Even if index will be added on the
updated_at(as in example above) query may still require to fetch data not from shared buffers, but from disks which will increase IO. - Schema evolution is not automated. If schema of table was changed, then query for fetching updated also should be actualized (if not
*was used) - You may need to do initial snapshot which will require to execute table full scan
In the end, despite the fact, that this method is quite simple to implement, it has crucial drawbacks. But even though in some simple cases it still may be used.
Listen & notify
Postgres has interesting feature which also may be used to implement CDC. It is a LISTEN NOTIFY mechanism. It allows to implement almost PubSub (of course with limitations, but anyway).
For our example this table may be used:
1CREATE TABLE data
2(
3 id BIGSERIAL PRIMARY KEY,
4 value TEXT
5);
After table is created we need to declare trigger and function for saving events in channel:
1CREATE OR REPLACE FUNCTION notify_table_change() RETURNS TRIGGER AS
2'
3DECLARE
4 payload TEXT;
5BEGIN
6 IF tg_op = ''INSERT'' THEN
7 payload := JSON_BUILD_OBJECT(''op'', ''INSERT'', ''id'', new.id, ''value'', new.value)::TEXT;
8 ELSIF tg_op = ''UPDATE'' THEN
9 payload := JSON_BUILD_OBJECT(''op'', ''UPDATE'', ''id'', new.id, ''value'', new.value)::TEXT;
10 ELSIF tg_op = ''DELETE'' THEN
11 payload := JSON_BUILD_OBJECT(''op'', ''DELETE'', ''id'', old.id, ''value'', old.value)::TEXT;
12 END IF;
13 PERFORM pg_notify(''my_channel'', payload);
14 RETURN NULL;
15END;
16' LANGUAGE plpgsql;
17
18CREATE TRIGGER data_notify_trigger
19 AFTER INSERT OR UPDATE OR DELETE
20 ON data
21 FOR EACH ROW
22EXECUTE FUNCTION notify_table_change();
After that we can check that events will be generated:
1INSERT INTO data(value) VALUES ('test');
And then fetch this:
1psql -h localhost -p 5432 -d postgres -U postgres
2
3postgres=# LISTEN my_channel;
4LISTEN
5Asynchronous notification "my_channel" with payload "{"op" : "INSERT", "id" : 1, "value" : "test"}" received from server process with PID 68.
In the real application special postgres driver should be used like pgjdbc-ng.
Short example can be found in this application.
This method has the next limitations:
- First of all, it’s a requirement to manually construct event payload.
- Second, which is based on the first one, is, again, schema evolution. You have to update the logic of your event builder function after new fields were added or the old ones were deleted
- Not all pg drivers support listen/notify mechanism. Of course, you can execute queries for listen/notify, but in most cases some kind of PubSub is required and for this the special drivers are needed
- Size of event payload is limited to ~8Kb and as for now it can’t be changed
- In postgres there is a single common queue for all channels with limit of 8Gb. If queue is full -> no new events can be emitted
Trigger-based
Trigger-based CDC can be considered as an evolution of select-based.
The idea is simple: we track changes in the source table and write them into a dedicated change_log table using triggers.
Compared with pure select-based, this approach can capture INSERT, UPDATE, and DELETE events explicitly.
Let’s use these two tables:
1CREATE TABLE data (
2 id BIGSERIAL PRIMARY KEY,
3 value TEXT NOT NULL,
4 updated_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT now()
5);
6
7CREATE TABLE data_change_log (
8 event_id BIGSERIAL PRIMARY KEY,
9 op TEXT NOT NULL,
10 data_id BIGINT NOT NULL,
11 payload JSONB NOT NULL,
12 created_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT now()
13);
Now we can define trigger function:
1CREATE OR REPLACE FUNCTION write_data_change_log() RETURNS TRIGGER AS
2'
3BEGIN
4 IF TG_OP = ''INSERT'' THEN
5 INSERT INTO data_change_log(op, data_id, payload)
6 VALUES (''INSERT'', NEW.id, to_jsonb(NEW));
7 RETURN NEW;
8 ELSIF TG_OP = ''UPDATE'' THEN
9 INSERT INTO data_change_log(op, data_id, payload)
10 VALUES (''UPDATE'', NEW.id, jsonb_build_object(''old'', to_jsonb(OLD), ''new'', to_jsonb(NEW)));
11 RETURN NEW;
12 ELSIF TG_OP = ''DELETE'' THEN
13 INSERT INTO data_change_log(op, data_id, payload)
14 VALUES (''DELETE'', OLD.id, to_jsonb(OLD));
15 RETURN OLD;
16 END IF;
17END;
18' LANGUAGE plpgsql;
19
20CREATE TRIGGER data_cdc_trigger
21AFTER INSERT OR UPDATE OR DELETE
22ON data
23FOR EACH ROW
24EXECUTE FUNCTION write_data_change_log();
Let’s verify that changes are captured:
1INSERT INTO data(value) VALUES ('first');
2UPDATE data SET value = 'first-updated' WHERE id = 1;
3DELETE FROM data WHERE id = 1;
4
5SELECT event_id, op, data_id, payload
6FROM data_change_log
7ORDER BY event_id;
A consumer application can read new events by checkpoint:
1SELECT event_id, op, data_id, payload
2FROM data_change_log
3WHERE event_id > {last_processed_event_id}
4ORDER BY event_id
5LIMIT 1000;
This approach is often used when:
- You need to capture deletes and full update history
- You can’t use WAL/logical replication in current environment
- You want explicit business-specific payload format
But there are also trade-offs:
- Trigger logic is custom code and should be maintained together with schema changes
- Additional writes increase load on source database
- If trigger fails, original write transaction fails too
- Ordering is local to this table (global ordering across many tables is harder)
Log-based
For log-based CDC we will need to slightly adjust settings in postgres:
1services:
2 database:
3 image: postgres:17
4 ports:
5 - "5432:5432"
6 healthcheck:
7 test: ["CMD-SHELL", "pg_isready -U postgres -d postgres"]
8 interval: 10s
9 timeout: 5s
10 retries: 5
11 start_period: 10s
12 environment:
13 POSTGRES_PASSWORD: postgres
14 POSTGRES_USER: postgres
15 POSTGRES_DB: postgres
16 command:
17 - "postgres"
18 - "-c"
19 - "wal_level=logical"
20 - "-c"
21 - "max_wal_senders=5"
22 - "-c"
23 - "max_replication_slots=5"
In this approach we don’t read data table directly, and we don’t create custom triggers.
Instead, we read changes from Write-Ahead Log (WAL) using logical replication.
Let’s create source table, logical slot and publication:
1CREATE TABLE data (
2 id BIGSERIAL PRIMARY KEY,
3 value TEXT NOT NULL
4);
5
6SELECT PG_CREATE_LOGICAL_REPLICATION_SLOT('app_cdc_slot', 'pgoutput');
7CREATE PUBLICATION app_cdc_publication FOR TABLE data;
To stream changes outside Postgres we can use Debezium connector (usually via Kafka Connect):
1{
2 "name": "postgres-cdc-connector",
3 "config": {
4 "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
5 "tasks.max": "1",
6 "database.hostname": "database",
7 "database.port": "5432",
8 "database.user": "postgres",
9 "database.password": "postgres",
10 "database.dbname": "postgres",
11 "database.server.name": "app",
12 "publication.name": "app_cdc_publication",
13 "slot.name": "app_cdc_slot",
14 "table.include.list": "public.data",
15 "topic.prefix": "debezium"
16 }
17}
Connector can be registered using simple HTTP call:
1curl -X POST http://localhost:8083/connectors \
2 -H "Content-Type: application/json" \
3 -d @connector-config.json
After that, changes in data table will appear in topic like debezium.public.data:
1INSERT INTO data(value) VALUES ('first');
2UPDATE data SET value = 'first-updated' WHERE id = 1;
3DELETE FROM data WHERE id = 1;
In this model events include metadata (operation type, transaction/order information, timestamps), so consumers can replay and process changes reliably.
Main advantages of this method:
- Low overhead for source tables (no polling queries, no custom triggers)
- Full stream of row-level changes including
INSERT,UPDATE, andDELETE - Better ordering/consistency guarantees based on database log sequence
- Schema evolution can be propagated automatically by CDC platform
Important limitations and operational details:
- Additional infrastructure is required (at least connector runtime and usually broker)
- Replication slots should be monitored: if consumer is down for long time, WAL will grow
- Initial snapshot can be expensive for very large tables
- Operational complexity is higher than
select-basedandtrigger-basedapproaches
In practice log-based CDC is the most common choice for near real-time integration between operational databases and streaming/analytics systems.
Conclusion
As we can see, there is no “best” CDC approach for all cases. The right choice depends on requirements for latency, consistency, infrastructure, and operational complexity.
Short practical guideline:
Select-basedis good for very simple and mostly append-only data flows where occasional event loss or collapsed updates are acceptable.Listen & notifyis useful for lightweight near real-time notifications inside one Postgres-centric system, but it is limited by payload size and queue specifics.Trigger-basedprovides explicit control over event format and captures all operation types, but increases write-path load and maintenance effort.Log-basedgives the most complete and scalable CDC stream, but requires additional components and operational maturity.
If your task is production-grade data integration between services or analytics platforms, log-based CDC is usually the default choice.
If your task is a small internal integration with minimal infrastructure, trigger-based or even select-based can be enough.