Data cleansing (or ‘data scrubbing’) is detecting and then correcting or removing corrupt or inaccurate records from a record set.
After cleansing, a data set will be consistent with other similar data sets in the system. The inconsistencies detected or removed may have been caused by different data dictionary definitions of similar entities in different stores, or caused by user entry errors or data which was corrupted in transmission or storage. Preprocessing the data will also guarantee that it is unambiguous, correct, and complete.
The actual process of data cleansing may involve removing typos or validating and correcting values against a known list of entities. The validation may be strict such as rejecting any address that does not have a valid ZIP code or fuzzy such as correcting records that partially match existing, known records. Data cleansing differs from data validation in that validation almost invariably means data is rejected from the system at entry and is performed at entry time, rather than on batches of data.
Example
Let’s examine a real-world scenario. In case of a data warehouse, let us assume that we are going to load invoices of customers' into a database. We will have the following data which we need to load to our data warehouse database..
Date, Product Code, Product Description, Customer Name , Quantity and Unit Price are the data that we are receiving as a data source. Let us assume that, that we have them in a text file. In the data warehouse database, we will have four tables namely dimCustomer, dimDate, dimProduct and fctSales and those will have following relationship.
So we have task of updating these tables with the data we are receiving from the text file. CustomerSK needs to be identified by performing a lookup on the dimcustomer table by using a customer name while dateSK needs to be identified by doing a lookup to dimdate with date filed. Similarly, productSK needs to be identified from dimproduct and you have product code and product name to find the productSK.
We will start the task by creating a SQL Server Integration Service (SSIS) project and then add a data flow task to the control flow. In the added data flow task, we will start the operations by adding a flat file source and pointing the csv file to the flat file source.
Using Lookup
The next step is to identify the relevant productSK for each record. We have two parameters to find the correct productSK - product code and product name. We can first use product code to lookup dimproduct table and if that fails we can use product name to get the productSK.
The above package diagram is the method of identifying the productSK from the dimproduct table. It starts with 12 records. When it passes through the first lookup,(Lookup by Product code), it found 9 records which matches a record in the product table. Below image is the lookup property screen for lookup by product code.
Productcode is used as the join column to get the productSK. In the reference table tab the reference table is dimProduct. You can use a table or a view as a reference, you can even use a query for a reference. For the optimal performance it is best to have an index on the column which you are going to lookup.
From the above lookup there are three rows which couldn't find matching records from the dimproduct table for the product code. Those three records are passed to another lookup. This time product name is used for column lookup. This lookup is implemented in the same manner as the product code lookup.
After this look up you can see that another two records were found. These two records and the previous nine records were taken into one stream by using a Union data flow task. However, there is another record which does not match either its product code or product name. This record is logged to another table with a new column introduced from the No Product Code derived column control. The purpose of this is to identify or correct this record manually and reload into the system at a later time.
To reduce error records, use as many as parameters as possible. In this case, to find the productSK, we have use two parameters - product code and product name.
Fuzzy Lookup
After finding the productSK the next task is to find customerSK. The below diagram shows the implementation of the method for finding customerSK which will be explained in detail.
We have one parameter to identify the customerSK, which is customer name. As with the previous case, we start the process by a simple lookup. You can see that out of 11 records , 8 records were matched against the customer name. However, there are three records which didn't match.
There are two reasons for this:
- no customer record exists.
- customer name is miss-spelled in the source.
Jonh Sward, Smith Warren and Matthew Harris are the customers which are not matched with the dimcustomer table. If you analysed the customer table there is no record for Matthew Harris. Nevertheless, there are records in the dimcustomer table for John Sward and Smith Allen, which are closely match with missing records.
So you can see that Matthew Harris falls into the first category which is no record in the customer table and Jonh may be spelled wrongly instead of John which is falls into the second category. However, it is not certain whether the Smith Warren a new customer or if it is a mis-spelling of Smith Allen.
The next question is how to identify the mis-spelled customers. For this, a fuzzy lookup can be used. Fuzzy Lookup uses the Error-Tolerant Index (ETI) to find matching rows in the reference table. Each record in the reference table is broken up into words (also known as tokens), and the ETI keeps track of all the places in the reference table where a particular token occurs. If you take the name as a example, if your reference data contains John Sward, the ETI will contain entries for John and Sward. In addition, Fuzzy Lookup indexes substrings, known as q-grams, so that it can better match records that contain errors. The more unique tokens and the more rows in the reference table, the more entries and longer the occurrence lists in the ETI. The ETI will be roughly as big as your reference table.
You can find all the details of fuzzy lookup ad how it works at http://msdn2.microsoft.com/en-us/library/ms345128.aspx.
The following screen will appear after selecting properties of fuzzy logic. Here, you can specify the reference table in which you have “clean” data. You have the option of saving the index into a table so that it uses that table in future without re-building the. In this way you can improve the performance of fuzzy logic.
From the columns tab you can define the columns you need to match.
The Advanced tab is an important tab when defining your fuzzy logic.
In case of dirty data, you may get multiple records. So you can define number of records that should return from the fuzzy lookup. Note that increasing the value of this property may increase the time it takes to process each input row.
Next is allocation of the Similarity threshold. Allocation of this totally depends upon your data.
Back to the SSIS package again. The three records will now be applied to fuzzy logic. To analyse, I have added a data viewer to the package just after the fuzzy logic. Below is the grid view:
You can see that two additional columns have been introduced namely _Similarity and _Confidence. As SSIS performs the fuzzy lookup, similarity and confidence indexes display the quality of the data match. The similarity thresholds indicate how closely the input data resembles its proposed match. A similarity value of one indicates an identical match. The closer the value is to one, the closer the match. To complement the similarity threshold, confidence describes the level of certainty that SSIS has about the match.
So selecting correct similarity and confidence totally depends on your application and dirtiness of your data. Before selecting a value better to have an idea of dirtiness of your data.
By using a conditional split transformer you can perform any actions you need.
In the above conditional data split we have defined accepting data _Similarity>0.65 and _Confidence >0.9 data. Data which does not fall into this category but has _Similarity>0.5 and _Confidence >0.8 are likely to be matched (although these matches may need manual intervention). It is best to log these into a screening table. At the end of the process, you can send a notification or generate a report using the Send Mail Task.
Then what about the data which does not fall into either of these categories? Those may be new customers. Normal practise, however, would be to log that data to a table and then later to manually deal with them. However, we will assume that in this case we are going to insert that record with new customer record.
You can insert a customer record and perform another lookup to identify customerSK for the new customer.
Even though it is simple lookup as we performed earlier, there is an important point to remember. In a Lookup there is a property called CacheType which is Full by default.
The type of caching can be,
full - the complete reference dataset is read before the input is processed.
partial - the transformation; limits the size of the cache to a specified size
none - each lookup results in a new query to the database
As you can see when the full option is set, the dataset is read before the input is processed. As we have used Customer lookup previously, when the next lookup is performed it uses the previously cached one. As we have inserted a new customer record, that record will not be retrieved by the subsequent lookup when the cache type is set to Full.
The Full cache type has the potential to give your package a performance boost, especially if the values you need to lookup are often repeated in the input. However, in this case we have to assign none so that it will perform a new query against the database.
Fuzzy Grouping
When standardized reference data is not available, fuzzy grouping develops a master set of unique records from a data set containing a combination of unique and duplicate records. For example, you may receive your customer data from different data sources.
If we consider our previous example, when we are trying to insert new customer records, there can be duplicated records. So we have to get the unique record rather than inserting duplicate records. To do this, you will need to drag a Fuzzy Group before the Insert New Customer Records transformation.
Configuration a Fuzzy group is a simple task.
The first tab, Connection Manager is used to assign the connection. The next tab, columns is to set the columns that you need to group.
You will need to select the columns that you will group. There are two parameters which are important when configuring fuzzy grouping - Numerals and Comparison Flags.
The numerals option allows confuguration of the significance of numbers in the input stream when grouping text logically. The options are to consider leading, trailing , leading and trailing, or neither leading nor trailing numbers significant. This option would need to be considered when comparing address or similar types of information. Comparison flags have a few more options for grouping.
Next is Advanced tab.
The Advanced tab contains the settings for controlling the fuzzy logic algorithms that assign groupings to text in the input stream. You can set the names of the three additional fields that will be added automatically to the output of this transformation. These fileds are named “_key_out”, “_key_in” and “_score” by default. As the rows flow through the pipeline, Fuzzy Grouping assigns an ID to each of them, which is the “_key_in”. When Fuzzy Grouping groups a collection of rows, it determines which row should be the representative. It then assigns all the “_key_out” columns of the rows in the group to the “_key_in” value of the representative. As a result, if you want to write only the representative rows to your output, you can filter the Fuzzy Grouping output through a conditional split transform that selects only rows where “_key_in” equals “_key_out”.
The token delimiters can also be set if, for example you don't want the comparison process to break incoming string by . (dot) or spaces. The default for this setting is all common delimiters.
To visualize what is happening from a fuzzy lookup grouping, I have added two data grid views, one before and one after the fuzzy group.
We had three non-matching customers which are shown in the top data viewer grid.
The next data grid is the data grid is generated after the Fuzzy Group. You can see that “_key_in” is unique. The “_key_out” value for customer names Matthew Harris and Matthew Harri is same. This is our target to eliminate duplicates.
The next step is to use a conditional split to identify the unique records.
As we need to identify the unique records the above condition will be sufficient. However, if you need to log duplicate records you can insert them into a table using the Not equal condition.
With that we have completed data cleaning using SSIS, the following diagram is the entire package. Please note that I have not included DateSK lookup.
Conclusion
In the real-world, data is dirty and cleaning will take consume a lot of resource. Using SQL Server Integration Service’s new tools, fuzzy lookup and fuzzy grouping, you have the higher degree of control over the cleaning process.
No comments:
Post a Comment