CDC is an important part of data processing. Using CDC you can achieve many goals from simple data replication to audit and complex ETL jobs. But implementing CDC is still a tough task (especially when considering not only happy-path). In this article I want to show different ways of implementing CDC. Also, I’ll try to not only show how to setup each variant, but also compare them with each other and highlight pros and cons of each option.

Prerequisite

Before we start, let’s look at required environment. Most examples can be reproduced using the only postgres instance without any additional setups, and it may look like this:

 1services:
 2  database:
 3    image: postgres:17
 4    ports:
 5      - "5432:5432"
 6    healthcheck:
 7      test: ["CMD-SHELL", "pg_isready -U postgres -d postgres"]
 8      interval: 10s
 9      timeout: 5s
10      retries: 5
11      start_period: 10s
12    environment:
13      POSTGRES_PASSWORD: postgres
14      POSTGRES_USER: postgres
15      POSTGRES_DB: postgres

If for any CDC option will additional components will be required it will be highlighted.

Change Data Capture (CDC)

In one of the previous article definition for CDC was described, but for clarity let’s shortly repeat it here again. Change Data Capture (CDC) is a technique used in data integration and database systems to identify, track, and deliver only the data that has changed since the last capture. This enables efficient, real-time or near-real-time data synchronization, replication, and analytics without processing entire datasets repeatedly.

CDC techniques

As we already know, CDC is a technique that allows to handle data changes. Depending on task and it’s requirements you may or may not want to track every change of every row. This is an important point, because correctly using it you can achieve your goal using simple solution (looking ahead a little, why set up complex solution which track every insert, update and delete if you know that your data is strictly append-only and volume of that data is not too high)

Select-based

The so-called select-based CDC type is one of the most simple method for implementation. Under the hood it requires:

  • Table with monotonically increasing field
  • Query which will be used to fetch updates using monotonically increasing field

What does monotonically increasing means? In a simple words it means that value of such field should never decrease – each new value should be equal or bigger than the previous one. More formal: for each a & b, if a < b, then f(a) <= f(b). Why this is important? Without such field we can’t find values that were updated and consequently can’t fetch and process those rows.

Consider this table:

1CREATE TABLE data (
2  id UUID PRIMARY KEY,
3  value TEXT,
4  updated_at TIMESTAMP WITH TIMEZONE NOT NULL 
5);

In this table field updated_at will be our monotonically increasing field. Updates can be fetched using the next simple query:

1SELECT id, value, updated_at FROM data WHERE updated_at > {last processed point};

last processed point – is the maximum value of updated_at from the last processed batch.

The problem of this method is that not all events can be handled:

  • DELETE events can’t be handled because SELECT query will return only existing rows. Technically you can find out which rows were deleted, but it will require to fetch all data from the source and compare it with the saved snapshot in the target.
  • Not all UPDATE events can be handled. If a single row was updated multiple times but only one fetch was done then only the last version of this row will be saved.
  • Some events may even be lost. Even so we use in our example updated_at there maybe be (and will) a case when multiple updated were done in different transactions but out SELECT query will see only committed updates.

Also, there is another drawbacks of this method:

  • Query overhead. Even if index will be added on the updated_at (as in example above) query may still require to fetch data not from shared buffers, but from disks which will increase IO.
  • Schema evolution is not automated. If schema of table was changed, then query for fetching updated also should be actualized (if not * was used)
  • You may need to do initial snapshot which will require to execute table full scan

In the end, despite the fact, that this method is quite simple to implement, it has crucial drawbacks. But even though in some simple cases it still may be used.

Listen & notify

Postgres has interesting feature which also may be used to implement CDC. It is a LISTEN NOTIFY mechanism. It allows to implement almost PubSub (of course with limitations, but anyway).

For our example this table may be used:

1CREATE TABLE data
2(
3    id    BIGSERIAL PRIMARY KEY,
4    value TEXT
5);

After table is created we need to declare trigger and function for saving events in channel:

 1CREATE OR REPLACE FUNCTION notify_table_change() RETURNS TRIGGER AS
 2'
 3DECLARE
 4    payload TEXT;
 5BEGIN
 6    IF tg_op = ''INSERT'' THEN
 7        payload := JSON_BUILD_OBJECT(''op'', ''INSERT'', ''id'', new.id, ''value'', new.value)::TEXT;
 8    ELSIF tg_op = ''UPDATE'' THEN
 9        payload := JSON_BUILD_OBJECT(''op'', ''UPDATE'', ''id'', new.id, ''value'', new.value)::TEXT;
10    ELSIF tg_op = ''DELETE'' THEN
11        payload := JSON_BUILD_OBJECT(''op'', ''DELETE'', ''id'', old.id, ''value'', old.value)::TEXT;
12    END IF;
13    PERFORM pg_notify(''my_channel'', payload);
14    RETURN NULL;
15END;
16' LANGUAGE plpgsql;
17
18CREATE TRIGGER data_notify_trigger
19    AFTER INSERT OR UPDATE OR DELETE
20    ON data
21    FOR EACH ROW
22EXECUTE FUNCTION notify_table_change();

After that we can check that events will be generated:

1INSERT INTO data(value) VALUES ('test');

And then fetch this:

1psql -h localhost -p 5432 -d postgres -U postgres
2
3postgres=# LISTEN my_channel;
4LISTEN
5Asynchronous notification "my_channel" with payload "{"op" : "INSERT", "id" : 1, "value" : "test"}" received from server process with PID 68.

In the real application special postgres driver should be used like pgjdbc-ng.

Short example can be found in this application.

This method has the next limitations:

  • First of all, it’s a requirement to manually construct event payload.
  • Second, which is based on the first one, is, again, schema evolution. You have to update the logic of your event builder function after new fields were added or the old ones were deleted
  • Not all pg drivers support listen/notify mechanism. Of course, you can execute queries for listen/notify, but in most cases some kind of PubSub is required and for this the special drivers are needed
  • Size of event payload is limited to ~8Kb and as for now it can’t be changed
  • In postgres there is a single common queue for all channels with limit of 8Gb. If queue is full -> no new events can be emitted

Trigger-based

Trigger-based CDC can be considered as an evolution of select-based.

The idea is simple: we track changes in the source table and write them into a dedicated change_log table using triggers.
Compared with pure select-based, this approach can capture INSERT, UPDATE, and DELETE events explicitly.

Let’s use these two tables:

 1CREATE TABLE data (
 2  id BIGSERIAL PRIMARY KEY,
 3  value TEXT NOT NULL,
 4  updated_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT now()
 5);
 6
 7CREATE TABLE data_change_log (
 8  event_id BIGSERIAL PRIMARY KEY,
 9  op TEXT NOT NULL,
10  data_id BIGINT NOT NULL,
11  payload JSONB NOT NULL,
12  created_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT now()
13);

Now we can define trigger function:

 1CREATE OR REPLACE FUNCTION write_data_change_log() RETURNS TRIGGER AS
 2'
 3BEGIN
 4  IF TG_OP = ''INSERT'' THEN
 5    INSERT INTO data_change_log(op, data_id, payload)
 6    VALUES (''INSERT'', NEW.id, to_jsonb(NEW));
 7    RETURN NEW;
 8  ELSIF TG_OP = ''UPDATE'' THEN
 9    INSERT INTO data_change_log(op, data_id, payload)
10    VALUES (''UPDATE'', NEW.id, jsonb_build_object(''old'', to_jsonb(OLD), ''new'', to_jsonb(NEW)));
11    RETURN NEW;
12  ELSIF TG_OP = ''DELETE'' THEN
13    INSERT INTO data_change_log(op, data_id, payload)
14    VALUES (''DELETE'', OLD.id, to_jsonb(OLD));
15    RETURN OLD;
16  END IF;
17END;
18' LANGUAGE plpgsql;
19
20CREATE TRIGGER data_cdc_trigger
21AFTER INSERT OR UPDATE OR DELETE
22ON data
23FOR EACH ROW
24EXECUTE FUNCTION write_data_change_log();

Let’s verify that changes are captured:

1INSERT INTO data(value) VALUES ('first');
2UPDATE data SET value = 'first-updated' WHERE id = 1;
3DELETE FROM data WHERE id = 1;
4
5SELECT event_id, op, data_id, payload
6FROM data_change_log
7ORDER BY event_id;

A consumer application can read new events by checkpoint:

1SELECT event_id, op, data_id, payload
2FROM data_change_log
3WHERE event_id > {last_processed_event_id}
4ORDER BY event_id
5LIMIT 1000;

This approach is often used when:

  • You need to capture deletes and full update history
  • You can’t use WAL/logical replication in current environment
  • You want explicit business-specific payload format

But there are also trade-offs:

  • Trigger logic is custom code and should be maintained together with schema changes
  • Additional writes increase load on source database
  • If trigger fails, original write transaction fails too
  • Ordering is local to this table (global ordering across many tables is harder)

Log-based

For log-based CDC we will need to slightly adjust settings in postgres:

 1services:
 2  database:
 3    image: postgres:17
 4    ports:
 5      - "5432:5432"
 6    healthcheck:
 7      test: ["CMD-SHELL", "pg_isready -U postgres -d postgres"]
 8      interval: 10s
 9      timeout: 5s
10      retries: 5
11      start_period: 10s
12    environment:
13      POSTGRES_PASSWORD: postgres
14      POSTGRES_USER: postgres
15      POSTGRES_DB: postgres
16    command:
17      - "postgres"
18      - "-c"
19      - "wal_level=logical"
20      - "-c"
21      - "max_wal_senders=5"
22      - "-c"
23      - "max_replication_slots=5"

In this approach we don’t read data table directly, and we don’t create custom triggers.
Instead, we read changes from Write-Ahead Log (WAL) using logical replication.

Let’s create source table, logical slot and publication:

1CREATE TABLE data (
2  id BIGSERIAL PRIMARY KEY,
3  value TEXT NOT NULL
4);
5
6SELECT PG_CREATE_LOGICAL_REPLICATION_SLOT('app_cdc_slot', 'pgoutput');
7CREATE PUBLICATION app_cdc_publication FOR TABLE data;

To stream changes outside Postgres we can use Debezium connector (usually via Kafka Connect):

 1{
 2  "name": "postgres-cdc-connector",
 3  "config": {
 4    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
 5    "tasks.max": "1",
 6    "database.hostname": "database",
 7    "database.port": "5432",
 8    "database.user": "postgres",
 9    "database.password": "postgres",
10    "database.dbname": "postgres",
11    "database.server.name": "app",
12    "publication.name": "app_cdc_publication",
13    "slot.name": "app_cdc_slot",
14    "table.include.list": "public.data",
15    "topic.prefix": "debezium"
16  }
17}

Connector can be registered using simple HTTP call:

1curl -X POST http://localhost:8083/connectors \
2  -H "Content-Type: application/json" \
3  -d @connector-config.json

After that, changes in data table will appear in topic like debezium.public.data:

1INSERT INTO data(value) VALUES ('first');
2UPDATE data SET value = 'first-updated' WHERE id = 1;
3DELETE FROM data WHERE id = 1;

In this model events include metadata (operation type, transaction/order information, timestamps), so consumers can replay and process changes reliably.

Main advantages of this method:

  • Low overhead for source tables (no polling queries, no custom triggers)
  • Full stream of row-level changes including INSERT, UPDATE, and DELETE
  • Better ordering/consistency guarantees based on database log sequence
  • Schema evolution can be propagated automatically by CDC platform

Important limitations and operational details:

  • Additional infrastructure is required (at least connector runtime and usually broker)
  • Replication slots should be monitored: if consumer is down for long time, WAL will grow
  • Initial snapshot can be expensive for very large tables
  • Operational complexity is higher than select-based and trigger-based approaches

In practice log-based CDC is the most common choice for near real-time integration between operational databases and streaming/analytics systems.

Conclusion

As we can see, there is no “best” CDC approach for all cases. The right choice depends on requirements for latency, consistency, infrastructure, and operational complexity.

Short practical guideline:

  • Select-based is good for very simple and mostly append-only data flows where occasional event loss or collapsed updates are acceptable.
  • Listen & notify is useful for lightweight near real-time notifications inside one Postgres-centric system, but it is limited by payload size and queue specifics.
  • Trigger-based provides explicit control over event format and captures all operation types, but increases write-path load and maintenance effort.
  • Log-based gives the most complete and scalable CDC stream, but requires additional components and operational maturity.

If your task is production-grade data integration between services or analytics platforms, log-based CDC is usually the default choice.
If your task is a small internal integration with minimal infrastructure, trigger-based or even select-based can be enough.