In the case of small and not slowly changed dimensions we can
truncate the table entirely and refill it again each time we want to update it,
but this way is not a good practice with big and slowly changing dimensions or fact tables, as it will consume a lot of resources and time to be
accomplished; so we have to load just those records that have been inserted or
modified, and to do so we have to create two more columns: the first one tracks when the record was created or last time the record was modified, the
second column contains when the last time the record was fetched, but this
approach will add overhead on the system that creates the transaction as it
should record the creation or modification datetime for each record, which in
turn requires modifying the front-end systems that create the records or modify
them.
To implement this you have to do the following:
- Enable Change Data Capture for the whole
databbase, by executing the following command:
USE theDatabase EXEC sys.sp_cdc_enable_db
where "theDatabase" is the name of the database that contains the tables you want to track. - Enable Change Data Capture for each table you want to track; so run the following script for each table:
EXECUTE sys.sp_cdc_enable_table@source_schema = N'dbo', @source_name = N'tableX', @role_name = N'cdc_Admin', @supports_net_changes = 1, @captured_column_list = N'column1, column2, column4'
where the:-@source_name is the name of the tracked table.-@role_name is the database role that will be used to establish database rights for the change data capture process, this database role will be created if it does not already exist in the database.-@supports_net_changes when set to 1 the change data capture for this particular table will support the net changes feature, in other words the record maybe modified more than once after that last load setting this parameter to 1 will provides a function that returns the last status (net changes) for that record over a given time period.-@captured_column_list is the list of column you want to capture changes for. you can remove this parameter if you want to capture modifications for all the columns. note that for each tracked columns there is a mirror column in change table. - In order to be able to query the change
table for a given period, we have to create wrappers for the all changes query
functions. the script that are required to create this function is the
following:
CREATE TABLE #WrapperSource (FunctionName sysname, SourceCode nvarchar(max))
INSERT INTO #WrapperSource
EXEC sys.sp_cdc_generate_wrapper_functionDECLARE @CurrFunctionName sysname
DECLARE @SourceCode nvarchar(max)SELECT @CurrFunctionName = MIN(FunctionName)
FROM #WrapperSourceWHILE @CurrFunctionName IS NOT NULL
BEGIN
SELECT @SourceCode = SourceCode
FROM #WrapperSource
WHERE FunctionName = @CurrFunctionName
EXEC sp_executesql @SourceCode
SELECT @CurrFunctionName = MIN(FunctionName)
FROM #WrapperSource
WHERE FunctionName > @CurrFunctionName
ENDDROP TABLE #WrapperSourceSELECT * FROM dbo.fn_all_changes_dbo_tableName('11/11/2011', '12/11/2011', 'all')
and the following is example of how to use it: - Create a table to save the last time the data was loaded.. Name it "lastTimeDataLoaded".
- Ceate new SSIS and create the following
variables:
-startDate as datetime-endDate as datetime-dataLoadQry as sting, set the EvaluateAsExpression property of this variable to true and set the Expression property to:"SELECT column1, column2, column4, __CDC_OPERATION AS Operation
FROM dbo.fn_all_changes_dbo_Orders('" + @[User::startDate]+ "', '" + @[User::endDate] + "', 'all')
ORDER BY __CDC_STARTLSN, __CDC_SEQVAL" - Add Execute SQL task to the package then
insert a sql script that queries the "lastTimeDataLoaded" table and
set startDate and endDate variable. the query is something like:
SELECT MAx(lastLoadDate) AS startDate, ATEADD(d,1,lastLoadDate) AS endDate From lastTimeDataLoadedassuming the package runs once every day at the same time.In the result set tab of the Execute SQL Task Editor map the result columns to the startDate and endDate variables.
- Add a data flow task. connect the Execute SQL task to the newly add data flow task. Inside the data flow task there is the following:- data flow source that gets data by executing dataLoadQry, which is connected to: - conditional split transformation that splits the return data to three data flow destinations: one for inserting new records, another for updating existing record and the last one for deleting records. The splitting done based on operation column return by the query, so if it's "I" this mean insert, if it's "U" this is update, if it's "D" this is for deletion.
- Finally go back to the Control flow tab and add another Execute SQL task. This task should contain a SQL update statement to update the lastTimeDataLoaded.
Reference: Delivering Business Intelligence With Microsoft SQL Server 2008 by Brain Larson.
No comments:
Post a Comment