10/03/2013

First Try #SSIS Fuzzy Lookup Transformation

Today I decided to learn more about Fuzzy Logic in Microsoft SSIS.

First thing was to make sure the Microsoft Visual Studio was installed and loaded.

Next step, download the AdventureWorks database, found here.  I selected to download the Zip files containing both the MDF and LDF files:


I found a good reference link which already has a sample Visual Studio project created and available for download.


Then opened SQL Server Data Tools (2012 version of BIDS), opened the project...


Pointed the Data Sources to the newly installed Adventure Works as well as create 2 new tables in SQL Server:
  1. Human Intervention
  2. Scrubbed Output
And pointed the 3rd Data Source to the file stored locally on the pc:

3.Parts Roster

All set to run the Project:



SELECT [PartName]
,[NumberOfPartsTaken]
,[_Similarity_Name]
,[ProductNumber]
,[_Similarity]
,[_Confidence]
,[_Similarity_PartName]
,[_Match]
FROM [dbo].[Human Intervention]

Returns:

SELECT [PartName]
,[NumberOfPartsTaken]
,[_Similarity_Name]
,[ProductNumber]
,[_Similarity]
,[_Confidence]
,[_Similarity_PartName]
FROM [dbo].[Scrubbed Output]


There were 105 Solid Matches:


There were 5 Non Matches:


1 Likely Match


Here's the Fuzzy Lookup Configuration:


Page 2:


And Page 3:


You actually have the option to modify the Index Table, In this case I chose the option to Store a New Index, gave it a name and clicked the option box to Maintain Stored Index.


What that does it to create the table in SQL Server, dbo.FuzzyLookMatchIndex


Selecting from the Table:

SELECT [Token]
,[ColumnNumber]
,[TokenProp]
,[BucketNumber]
,[Freq]
,[Rids]
FROM [dbo].[FuzzyLookupMatchIndex]

Produces:


It also creates an index table:

SELECT [Name]
,[ProductNumber]
,[RID_131003_14:44:39_10248_55413ee6-37a9-4e1e-b991-73998e085eba]
FROM [dbo].[FuzzyLookupMatchIndex_FLRef_131003_14:44:39_10248_55413ee6-37a9-4e1e-b991-73998e085eba]


Every time you run the package it needs to parse through the data and create the Error Tolerant Index (ETI) file, in our example there wasn't much data, however, it data sets of 10s of Millions of rows there potential for slowdown.

You can actually select the option to point to an existing Index table already created on SQL Server.

And by selecting the Maintain Stored Index option box, it creates a Trigger on the reference table that looks for and handles modifications, keeping everything in synch.

You could partition off the Fuzzy Matchup into multiple components, each reading different data, all pointing to the same ETI file, and then have the package Union All the results back into a single record set.

This is one way to reduce the runtime of large data sets.

On the Advanced tab, there is a slider which allows the developer to adjust the match number as well as set the Maximum number of matches to output per lookup.


I modified the slider from 0.00 to 50.00 and the results changed:



The Conditional Split looks like this:


The Union All looks like this:


And that's about it for Fuzziness today.  As you know, the subject is very complex and goes far down the rabbit hole, this blog post just documents my entry into the world of Fuzzy Transformation and how you can get started if you had an interest.

My thanks to Michael K. Campbell (is a professional SQL Server consultant with years of experience as a DBA and database developer. - See more at: http://www.sqlteam.com/article/using-fuzzy-lookup-transformations-in-sql-server-integration-services#sthash.dFwnT5Tg.dpuf)
for providing the original Blog post and Project Source Code.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.

Thoughts to Ponder