To load data from OLTP system to DW, we have to face a problem: how to balance time and cost. Since data raises faster and faster, we need to increase our hardware ability to match the time requirement. So, incremental load coming out to reduce the data transmission significantly. There are three way to achieve it: 1. use datetime column 2. Changed data capture 3. changed data tracking. For a long time, I use the first way to capture the changed data manual, it is good, but too many works in development and testing. Here I want to introduce the CDC. Basically, CDC just a feature to utilize LSN(Log Sequence Number) and log tables to capture changed data, while SSIS itself provide native components to easy work with this new feature.
Let’s build a CDC workflow in SSIS for example:
Enable CDC feature
- enable CDC by executing
sp_cdc_enable_db
(disable bysp_cdc_disable_db
). Then check it withselect name from sys.databaseswhere is_cdc_enabled=1
- enable CDC for spec table with
sp_cdc_enable_table
then we can find the CDC table in systemtable folder or usingselect namefrom sys.tables tabwhere is_tracked_by_cdc=1
to check.
exec sys.sp_cdc_enable_table
@source_schema = N'Person'
, @source_name = N'Address'
, @role_name = N'cdc_Admin'
, @capture_column_list = N'column1, column2'; //can track spec columns, rather than the whole table
Control flow setting
- add
CDC control task
- set CDC control operation to
mark cdc start
and set the cdc states for saving cdc states - run this control task, it will create a record in table
cdc_states
- create two
CDC control tasks
, one set operation toGet Processing Range
, another forMark process range
, they will get changed data and update CDC states respectively. - put a dataflow which is response for ETL operation, between two CDC control tasks.
Data flow setting into staging table
- add
CDC source
which points to the table enabled CDC and choose the correct cdc_states table as well. - choose
Net
CDC processing mode in CDC source. - add
CDC splinter
after CDC source, create threeDerived Column transformation
for insert(0), update(2) and delete(1) data. - create a
Union All transformation
to union all data and export to stage database. - if necessary, we need to add a
truncate
script before all control flow to delete everything in stage database.
Update fact table through staging tables
- create a oledb source to connect to stage database
- use conditional split to split
insert
andupdate+delete
- for
insert
, we directly export; forupdate+delete
we need to delete from fact table by identifier byOLE DB Command transformation
, and use conditional split to export update data. - if necessary, use lookup to replace some dimensional columns
- export to fact database.
