While some might observe that the difference between Slowly Changing Dimensions (SCD) And Change Data Capture (CDC) is subtle, there is, in fact, a technical difference between the two processes.
Both processes detect changes in a source database and deliver the changed data to a target database. The difference between the two is almost entirely about what happens in the target database to the data.
What are Slowly Changing Dimensions (SCD)?
There are actually six types of SCD with the most common being Type 1, Type 2 and Type 3. SCD types 4, 5, and 6 are inefficient and overly complicated for maintaining a history of all changes or overwriting old data, which are the two essential purposes of Slowly Changing Dimensions.
In Type 1, any new data that is ingested overwrites existing data. In Type 2, new data are inserted as new records and the data that would have been overwritten are flagged as inactive or closed with effective time and expiration time assigned to the change to maintain a history. In Type 3, one column is designated for storing previous data (i.e. the data that would’ve been overwritten in Type 1).
In short, Type 1 stores no historical data, Type 2 stores all historical data, and Type 3 stores limited historical data.
What is change data capture?
CDC is a method of detecting and extracting new or updated records in a source and loading just this new information into your destination. Very often, the alternative to CDC is a full load from one table to another, resulting in a very costly and time-consuming operation. By sipping into your target database just the delta or changed data, you get a much more streamlined process.
There are actually three different ways of performing CDC: log-based, query-based, and trigger based. Differences that are explored elsewhere in our blogs in detail. Essentially, however, log-based CDC updates a log for every INSERT, UPDATE or DELETE and reads that information when it is time to insert into the target database, while trigger CDC kicks off a trigger for every operation with the same result. Log-based CDC is considered to be more efficient than a trigger CDC method. Query-based CDC involves using queries to find differences between datasets and can be untenable with larger datasets as it can require much more resources to perform this comparison.
CDC looks the most like Type 1 Slowly Changing Dimensions, overwriting new data as it appears. It is most useful when you’re not worried about maintaining a history of all the changes to your database. Like most other modern data integration systems, StreamSets supports log-based CDC.
Choosing Slowly Changing Dimensions vs. Change Data Capture
The choice between SCD and CDC is generally decided based on what the tool you’re using supports. But that’s not the way it should be.
Choosing SCD over CDC or vice versa should be based on business processes, not technical limitations. SCD is ideal for organizations that must maintain a record of all changes to the data flowing through their systems. And CDC is ideal if your business process requires only that the changed data arrive at your target.