Architecture and CDC

CDC uses a log-based data capture for the changed data records, propagates the data (from the source table) using replication remote procedural calls (RPCs) through an internal data-fabric gateway and produces the data to a HPE Ezmeral Data Fabric Streams destination stream topic(s). Once data is received by the topic, the changed data records can be consumed by external applications. The consumer application registers the CDC Deserializer as its record value deserializer and pulls the topic data by using a Kafka API. The data changes can be read from the ChangeDataRecord through the OJAI ChangeData APIs. Consumers could be databases, data archives, search engines, or applications that perform real-time analytics, security, or monitoring.

How are the Change Data Records Propagated?

The propagation is accomplished by setting up a change log that establishes a relationship between the source table and the destination stream. The change log can be setup by using the Control System, maprcli, or REST. Each change log can be paused, resumed, and removed. See Administering Change Data Capture and the maprcli table changelog command for more information.

As data is changed on the source table (through CRUD operations), each changed data record is propagated (replicated) to an internal data-fabric gateway. The order of when the data is produced to the stream topic is the same order of when the changed data records are replicated to the gateway. The data flow is one way, meaning, the flow is from a HPE Ezmeral Data Fabric Database source table to a HPE Ezmeral Data Fabric Streams destination stream topic(s).

NOTE When an array value is updated, the changed data record is the full array record rather the specific data change.

What is the Impact of using Columns/Column Families?

When propagating a specific column family or column from a binary source table and a row is deleted, the destination stream topic shows only a deletion event for the specific column family or column. When propagating a specific column from a binary source table with its entire column family deleted, the destination stream topic shows only a deletion event for the specific column.

In the scenario where you have a binary source table with fam0, fam1, and fam2 and you set up the change log without columns or column families:

If you delete fam0, fam1, and fam2, the change data event will be "delete fam0", "delete fam1" and "delete fam2".
If you delete the row, the change data event will be "delete row".

In the scenario where you have a binary source table with fam0, fam1, and fam2 and you set up the change log with a column setup as fam1:col1, fam2.

If you delete fam0, fam1, and fam2, the change data event will be "delete fam1:col1", "delete fam2".
If you delete the row, the change data event will be "delete fam1:col1", "delete fam2".

Where is the Destination Stream Setup?

The destination HPE Ezmeral Data Fabric Streams stream can either be on the same cluster as the HPE Ezmeral Data Fabric Database source table or on a remote data-fabric cluster. Where and how destination streams are setup depends on the purpose for using CDC.

If you are propagating changed data from a source table on a source cluster to a destination stream topic on a remote destination cluster, you must setup a gateway. Gateways are setup by installing the gateway on the destination cluster and specifying the gateway node(s) on the source cluster. See Administering Data Fabric Gateways and Configuring Gateways for Table and Stream Replication.

The following diagram shows a simple CDC data model, with one source table to one destination topic on one stream. Since this scenario has the destination stream topic on a remote destination cluster, you must setup and configure a gateway.

NOTE More complex CDC scenarios can be implemented and multiple gateways can be setup.

IMPORTANT If you have a secure cluster, you must setup secure configuration. See Configuring Secure Clusters for Cross-Cluster Mirroring and Replication.