debezium-postgres2lake
debezium-postgres2lake is a Java 21 / Quarkus multi-module CDC solution that processes PostgreSQL change events and persists them into data lake and lakehouse storage formats.
This project provides a complete, extensible foundation for CDC processing from source database events to analytics-ready lake targets.
Purpose
Teams often need CDC pipelines they can inspect, adapt, and evolve. This project provides:
- A structured alternative to ad-hoc CDC scripts
- A modular codebase for format and integration customization
- A practical path from PoC to production-oriented CDC processing architecture
End-to-end flow
- PostgreSQL emits CDC via logical replication
- Embedded Debezium engine captures change events
- Events are normalized into internal records with operation metadata
- Format-specific writers persist records to Avro/Parquet/ORC/Iceberg/Paimon targets
- Optional SPI handlers integrate catalog sync, commit notifications, and readiness markers
Core capabilities
- Multiple output formats: Avro, Parquet, ORC, Iceberg, Paimon
- Operation-aware CDC processing for insert, update, and delete events
- Schema evolution handling for table formats and schema-aware file writing
- Extension APIs and reference implementations for: data catalog integrations, commit event emission, readiness marker emission