Sunday, November 20, 2011

Loading big and slowly changing dimensions and fact tables using SSIS


In the case of small and not slowly changed dimensions we can truncate the table entirely and refill it again each time we want to update it, but this way is not a good practice with big and slowly changing dimensions or fact tables, as it will consume a lot of resources and time to be accomplished; so we have to load just those records that have been inserted or modified, and to do so we have to create two more columns: the first one tracks when the record was created or last time the record was modified, the second column contains when the last time the record was fetched, but this approach will add overhead on the system that creates the transaction as it should record the creation or modification datetime for each record, which in turn requires modifying the front-end systems that create the records or modify them.
If you source database is SQL server 2008, a new feature called "change data capture" uses the SQL transaction log to capture data modifications then write those modifications to a data capture tables. A change table is created for each table that is being tracked by change data capture and this table is named as cdc.schemaname_tablename_CT.
To implement this you have to do the following:
  1. Enable Change Data Capture for the whole databbase, by executing the following command:
    USE theDatabase EXEC sys.sp_cdc_enable_db

    where "theDatabase" is the name of the database that contains the tables you want to track.
  2. Enable Change Data Capture for each table you want to track; so run the following script for each table:
    EXECUTE sys.sp_cdc_enable_table@source_schema = N'dbo', @source_name = N'tableX', @role_name = N'cdc_Admin', @supports_net_changes = 1, @captured_column_list = N'column1, column2, column4'

    where the:
    -@source_name is the name of the tracked table.
    -@role_name is the database role that will be used to establish database rights for the change data capture process, this database role will be created if it does not already exist in the database.
    -@supports_net_changes when set to 1 the change data capture for this particular table will support the net changes feature, in other words the record maybe modified more than once after that last load setting this parameter to 1 will provides a function that returns the last status (net changes) for that record over a given time period.
    -@captured_column_list is the list of column you want to capture changes for. you can remove this parameter if you want to capture modifications for all the columns. note that for each tracked columns there is a mirror column in change table.
  3. In order to be able to query the change table for a given period, we have to create wrappers for the all changes query functions. the script that are required to create this function is the following:
    CREATE TABLE #WrapperSource (FunctionName sysname, SourceCode nvarchar(max)) 

       INSERT INTO #WrapperSource
       EXEC sys.sp_cdc_generate_wrapper_function
       DECLARE @CurrFunctionName sysname
       DECLARE @SourceCode nvarchar(max)
       SELECT @CurrFunctionName = MIN(FunctionName)
       FROM #WrapperSource
       WHILE @CurrFunctionName IS NOT NULL
       BEGIN
         SELECT @SourceCode = SourceCode
         FROM #WrapperSource
         WHERE FunctionName = @CurrFunctionName
         EXEC sp_executesql @SourceCode
         SELECT @CurrFunctionName = MIN(FunctionName)
         FROM #WrapperSource
         WHERE FunctionName > @CurrFunctionName
      END
      DROP TABLE #WrapperSource

    and the following is example of how to use it:
     SELECT * FROM dbo.fn_all_changes_dbo_tableName('11/11/2011', '12/11/2011', 'all')
  4. Create a table to save the last time the data was loaded.. Name it "lastTimeDataLoaded".
  5. Ceate new SSIS and create the following variables:
                    -startDate as datetime
                    -endDate as datetime
                    -dataLoadQry as sting, set the EvaluateAsExpression property of this variable to  true and set the Expression property to:
    "SELECT column1, column2, column4, __CDC_OPERATION AS Operation
    FROM dbo.fn_all_changes_dbo_Orders('" + @[User::startDate]+ "', '" + @[User::endDate] + "', 'all')
    ORDER BY __CDC_STARTLSN, __CDC_SEQVAL"
  6. Add Execute SQL task to the package then insert a sql script that queries the "lastTimeDataLoaded" table and set startDate and endDate variable. the query is something like:
        SELECT MAx(lastLoadDate)  AS startDate, ATEADD(d,1,lastLoadDate) AS endDate From lastTimeDataLoaded
        assuming the package runs once every day at the same time.
    In the result set tab of the Execute SQL Task Editor map the result columns to the startDate and endDate variables.
  7. Add a data flow task. connect the Execute SQL task to the newly add data flow task. Inside the data flow task there is the following:- data flow source that gets data by executing dataLoadQry, which is connected to:                         - conditional split transformation that splits the return data to three data flow destinations: one for inserting new records, another for updating existing record and the last one for deleting records. The splitting done based on operation column return by the query, so if it's "I" this mean insert, if it's "U" this is update, if it's "D" this is for deletion.
  8. Finally go back to the Control flow tab and add another Execute SQL task. This task should contain a SQL update statement to update the lastTimeDataLoaded.
Reference: Delivering Business Intelligence With Microsoft SQL Server 2008 by Brain Larson.

No comments:

Post a Comment