|
Introduction
Slowly changing dimensions (SCD) are very useful in data warehouses. Even though they are useful, it is very difficult to implement slowly changing dimensions with most of ETL tools. In case of SQL Server 2000 it was difficult to implement SCD's as there was no direct way of implimenting them. As you are aware there were numerous changes in SQL Server 2005 including the introduction of a SCD data flow task. This article explores the options available in the SQL Server 2005 for SCD's.
Requirements
You need to have installed SQL Server 2005 with Business Intelligence Development Studio (BIDS). It will also be beneficial if you know how to create simple SQL Server Integration Services (SSIS) packages. However, as usual I will elaborate on how to develop them when and where ever possible.
What is SCD?
The "Slowly Changing Dimension" problem is a common one particular to data warehousing. In a nutshell, this applies to cases where the attribute for a record varies over time. Let us take this hypothetical example. Let us say that we have a fact table that contains customersk. The Fact table is linked to dimcustomer and there is the following relationship between these two tables.
This dimcustomer table data is taken from the OLTP database which has the following table structures and relationships.
Type 1 SCD
A Type 1 SCD simply overwrites the original information. In other words, no history is kept. For example, if one customer’s region is changed from one to another, the previous region is overwritten by the new one and the previous data will be lost.
This is the easiest way to handle the SCD problem, since there is no need to keep track of the old information. An obvious disadvantage of this type of SCD is that you are unable to analyse historical data. For example, all the previous sales for a customer will be analysed in the new region and not in the old region. Due to the simplicity of implementing a Type 1 SCD, it is revealed that more than 60% of dimensions fall into the Type 1 category.
Implementation of Type1 SCD
Now let us see how we can implement a Type1 SCD with SQL Server 2005.
First create a SSIS Project by launching BIDS. Then add a new SSIS package to the project. The next step is to create a new connection manager pointing to the relevant database. Then add a data flow task to the control flow and double click the added data flow.
Add an OLEDB source editor and select SQL command as the data access mode and enter the following T-SQL command.
SELECT cus.CustomerID,
cus.FirstName,
cus.LastName,
cus.MaterialStatus,
reg.RegionName,
cat.CategoryName
FROM staging.tblCategory AS cat
INNER JOIN staging.tblcustomer AS cus ON cat.ID = cus.CategoryID
INNER JOIN staging.tblRegion AS reg ON cus.RegionID = reg.ID
The connection manager should look like this:
Next, drag and drop the Slowly Changing Dimension to the data flow and connect it to the output of the oledb source. After dropping it, you will be taken to the SCD wizard. If not, double click the control so that you will be taken to the wizard.
The first screen that you will get is to select the mapping between the source (staging) and the dimension (data warehouse) table.
Next you need to select the dimension table. In this case it is datawarehouse.dimcustomer.
After that our next task is to map the input and dimensions columns. By default, columns with similar names and data types will be mapped. Out of the existing columns you need to select at least one as the Business key. The Business key is the key which should be used to lookup the tables to verify whether it needs an insert or an update.
You will not be able to map two columns if they have different data types. If you don’t have the same data types you will have to convert them to the data type of the dimension table by using a data conversion data flow transformation.
The next step is to allocate change type to the columns.
In the case of a Type 1 SCD, there are two types of change types available. They are fixed attribute and changing attribute. Fixed attributes are the attributes which do not changes. For example, the first name and last name will not change if not there is typing errors. Changing attributes are the attributes which will overwrite with the existing values.
Next will be following screen.
The fixed attributes option is to set fail in case errors are detected in a fixed attribute. If you set this option to true, the entire SCD transformation will fail.
The changing attributes option is used in Type 2 SCD's hence we will discuss about this in next article.
After doing all this and clicking the finish button you will see the final package that looks like:
You can see that two paths are created, one for the insert (New Output path) and the other for the update (Changing Attribute Updates Output path). Apart from these default paths we can use the Unchanged output path if we need to monitor unchanged records.
Not only can you add new paths, you can also change the existing path. For example, in case you need to add new columns, you can introduce Derive Column component.
The problem with the SCD wizard is after configuring everything, if you redo the wizard again, all your modifications will be removed and you will need to add them again.
Let us look at these data flows tasks which are created by the SCD wizard.
First we will look at the Insert Destination control.
If you double click the control you will be taken to the following screen.
You can see that the insert data flow task is a standard OLE DB Destination.
If you double click the OLE DB command data flow task, you will see that it is an OLE DB command data flow task with the following update statement:
UPDATE [datawarehouse].[DimCustomer] SET [CategoryName] = ?,[MaterialStatus] = ?,[RegionName] = ? WHERE [CustomerID] = ?
Performance
In the case of a large dimension, it will take more resources for SCD's. Hence you need to ensure that a SCD is the correct solution for your problem. I have seen many places where SCD's have been used but there are only inserts performed. In this situation, rather than using a SCD, you can use a Lookup data flow task and include the OLE DB Destination at the failure path of the lookup.
Another important improvement you can do is by introducing an Index to the business key. The business key is used to identify whether the record needs to be updated or to be inserted. Also, in the case of an update it uses the WHERE clause to update the table.
Issues
I have come across three issues with the SCD data flow task in SSIS.
The first issue I have already specified. That is when you modify the SCD configuration, after performing the modification you will lose the changes you have done to the output paths of the SCD. However, there is small chance of doing modification to your mappings. You can do this by right clicking the SCD data flow task and selecting the Show Advanced Editor option. However, this change will not reflect in the insert and update path. Hence you have to do those changes manually.
Another requirement is for missing data. For example, if a customer is deleted from the staging (Meaning, we have a customer record in the data warehouse area but not in staging area), there is no way of tracking it from the SCD. Theoretically, we are not deleting data warehouse data, but at least we may need this information to update a flag saying that specific customer is deleted.
Another issue is the fixed attribute change type. If you select the option failed the transformation if changes are detected in a fixed attributes, the entire package will fail. Instead there should be a way to get those changes so that they can be logged to different table for further analysing and not affecting the other records.
What's Next
We have discussed Type1 SCD in this article and hopefully this has given you an introduction to the SCD. If you need the sample package and the sample table schema and data queries, you can email me at dineshasanka@dbfriend.net.
In the next article I will introduce Type 2 SCD's and inferred data updates.
No comments:
Post a Comment