|
Introduction
Having discussed type1 slowly changing dimensions (SCD) in Part 1 of this article, this article will talk about further on SCDs and how we can implement them in SQL Server 2005.
What is Type2 SCD
A Type 2 SCD writes a record with the new attribute information and preserves a record of the old dimensional data. Implementing type2 changes within a data warehouse environment might require significant analysis and development time. These changes accurately partition history across time more efficiently than other types. However, they also add new records to the data warehouse environment thus, significantly adding to the database size.
Let's take a look at an example:
Staging
Customer Code | Customer Name | Region |
53724 | Melody | Raleigh |
6705 | Dan | Lubbock |
Dimension
CustomerSK | Customer Code | CustomerName | Region |
1 | 53724 | Melody | Raleigh |
2 | 6705 | Dan | Luke |
CustomerSK | SalesAmount | Date |
1 | 1500 | 01/01/2008 |
2 | 2500 | 01/01/2008 |
In the above example, the dimension table is created from the staging table. In the dimension table the surrogate key (CustomerSK) is generated. In the fact table, the surrogate key from the customer dimensions are taken.
Let us assume that, after the 1st of January 0f 2008, the Region for the customer Dan is changed to Fresno from Lubbock. If you can remember in the Type 1 SCD, the existing record will be updated to Fresno in the dimension table and after that there is no way that you can get the previous information.
However, in the business perspective, if the dimension record is updated the previous data will be read as a new region, which may lead to the wrong business or management decision.
Let's see how this problem can be overcome with the use of Type 2 SCDs. Below is example for Type 2 SCDs.
Staging
Customer Code | Customer Name | Region |
53724 | Melody | Raleigh |
6705 | Dan | Lubbock |
Dimension
CustomerSK | Customer Code | CustomerName | Region |
1 | 53724 | Melody | Raleigh |
2 | 6705 | Dan | Luke |
3 | 6705 | Dan | Fresno |
CustomerSK | SalesAmount | Date |
1 | 1500 | 01/01/2008 |
2 | 2500 | 01/01/2008 |
In the above example, you can see that a new record is inserted for the change whilst the previous record is left unchanged. For the latest fact record a new customer surrogate key will be used. Because of this, the historical data for region will be maintained and new data will be read as new region.
In this type of design, you have to include a new column to identify which is the active or current dimension record.
Most of the time, you can include a flag saying whether it is the current or active record. However, the most correct way of doing this is, is to include the start date column and end date column. In the case of the current record the end date column will be null. In the case when you need to reload the entire fact data, keeping the date columns will be helpful. If you have a flag without a date field, you will not be able to find out the fact data that belongs to a customer dimension.
Implementation of Type 2 with SSIS
Let's look at the same example which we used for the Type 1 SCDs in the previous article. Like we did before we will create a SSIS project and create new SSIS package.
We will then add a Data flow task to the control flow and double click the data flow task. We will then drag and drop an OLE Database Source and the following is what should enter for the SQL Command.
SELECT cus.CustomerID, cus.FirstName, cus.LastName, cus.MaterialStatus, reg.RegionName, cat.CategoryName
FROM staging.tblCategory AS cat INNER JOIN staging.tblcustomer AS cus
ON cat.ID = cus.CategoryID
INNER JOIN staging.tblRegion AS reg
ON cus.RegionID = reg.ID
Next, drag and drop a Slowly Changing Dimension control and let's configure it.
As in the Type1 SCD, the first screen will ask you to map the source and target columns and select the business key which is CustomerID.
The next step is to manage the changes to column data in the slowly changing dimensions by settings the change type for dimension columns.
From the above image, you can see that the CategaryName, FirstName, LastName and MaritalStatus are configured as changing attributes which means that if there is a change in these columns a new record will not be updated. Instead, the existing record will be updates. However, for the RegionName the change type is set to Historical attribute which means that historical data will be kept for RegionName.
The next option is to set whether to change all the matching records, including outdated record when changes are detected. This option is useful to update things like names. If a customers name has changed it could be because ofa typing error. Therefore it is better to change all og the historical records as well.
Next is to configure the historical attribute option. As we discussed before, we need to identify what is the current record of the customer. For example, if customer has changed from Region A to Region B, we will have two records. When loading the fact table we need to assign current customers. For that we need to know what the current customer record is. As discussed before, there are two ways of doing this.
1. Assign attribute (True/False or Current/Expired) to a dimension field.
In the above configuration, you need to first assign the column to indicate the current record. In this example it is IsCurrentRecord which has a bit data type. In the above configuration, True is configured to the current value and false is configured to the Expiration value.
There is another default configuration which is Current/Expired. Unlike True/False which can be assigned to Bit/Int/Varchar data types, Current/Expired can be allocated only to a varchar.
You can have your own configuration for this field of your choice. My favourite values are Active/Historical.
2. Assign start date and end date for the dimension record. For the current record the end date is null.
In this configuration, you can assign the start date and end date column. Both columns should have datetime or smalldatetime data types. In this configuration, you have to configure value to set data values. In this you will be given a list of available date time variables in the package. Out of these there are two system variables you can assign. System::StartTime is the start time of the package and System::ContainerStartTime is the start time of the container.
Ok, that’s all you have to do for the Type 2 SCD and the following is what you get:
Though you don’t need to know what these components do, it will be better to know how it works.
Changing Attribute Updates Output will update changing attributes like the firstname columns as in this example. The Historical Attribute Inserts Output path introduces a new variable depending on the selection you made in historical attribute option. If you selected to use date time value it will be @[System::ContainerStartTime] or if select the option historical it will be value you need to update the column will be derived to a new column and it will be updated to say that record is expired.
After the update Historical Attribute Inserts Output and New Inserts paths are unioned. Then derive a column is used to derive the current and lastly it will insert records in to the table.
Inferred Member Updates
Inferred members are also another important fact when it comes to dimensions. Sometimes, though the dimension record is inserted, it is not the final user. For example, if a broker buys house for another person, obviously he is not the final user. Thus it will give you wrong information if you filled a dimension record with broker’s data. However, you need to insert some dimension record which will be updated when the broker sells the house to the final user.
Implementation of Inferred Member Updates in SSIS
In the above SCD configuration, there is another screen to configure inferred member updates.
After enabling the inferred member support, you need to select either of the options available here. If you select the first option, it will treat all columns with a change type with null as inferred member. The second option, you can include new Boolean column to indicate whether the current record is an inferred member. Second option is the most correct way of doing it.
After configuring this the following is the layout of the SCD.
After configuring the inferred member updates, there is an additional path for Inferred Member Updates Output which will have OLEDB Command.
Other Types of SCD
There are several other SCD types which are hardly used. For the completeness and academic purposes, you can read them in Wikipedia. Because of the fact that they are hardly used, the SSIS SCD control does not support these SCD types. If you really have a requirement to implement these SCD types, you have to use other data flow tasks.
Sample Packages
We have discussed Type2 SCD in this article and hope this will give you an introduction to the SCD including the previous article. If you need the sample package and the sample table schema scripts, drop me an email to dineshasanka@dbfriend.net.
No comments:
Post a Comment