10/03/2013

Troubleshooting Fuzzy Grouping Performance Issues Using SQL Profiler

In attempts to understand Fuzzy Grouping better, I created a quick package.

Then executed successfully, as rows appeared in the SQL-Server database.


However, I want to get under the hood, so I ran a SQL Profiler:

SELECT COUNT(*) FROM .[tempdb]..sysobjects WHERE name='##FuzzyLookupMatchIndex_131003_17:23:53_11216_8bad9401-1ec8-4c97-bc1a-e0b35b58f234'

CREATE TABLE [tempdb]..[##FuzzyLookupMatchIndex_131003_17:23:53_11216_8bad9401-1ec8-4c97-bc1a-e0b35b58f234]([Token] nvarchar(102), [ColumnNumber] int, [TokenProp] int, [BucketNumber] timestamp, [Freq] int, [Rids] nvarchar(4000))

ALTER TABLE [tempdb]..[##FuzzyLookupMatchIndex_131003_17:23:53_11216_8bad9401-1ec8-4c97-bc1a-e0b35b58f234] ALTER COLUMN [Token] nvarchar(102) COLLATE Latin1_General_CS_AS_KS_WS

exec [sys].sp_describe_first_result_set N'select * from [tempdb]..[##FuzzyLookupMatchIndex_131003_17:23:53_11216_8bad9401-1ec8-4c97-bc1a-e0b35b58f234]'

SELECT [Column0], [Column1], [_key_in] FROM [tempdb]..[##FLInRef_131003_17:23:53_14044_8ba8d4c1-ce9f-41ef-96ea-e55bce5ce8e4] ORDER BY [_key_in]

CREATE CLUSTERED INDEX token_idx ON [tempdb]..[##FuzzyLookupMatchIndex_131003_17:23:53_11216_8bad9401-1ec8-4c97-bc1a-e0b35b58f234]([Token], [ColumnNumber], [TokenProp], [BucketNumber] ASC)

exec [sys].sp_describe_first_result_set N'select * from [tempdb]..[##FuzzyLookupMatchIndex_131003_17:23:53_11216_8bad9401-1ec8-4c97-bc1a-e0b35b58f234]'

insert bulk [tempdb]..[##FuzzyLookupMatchIndex_131003_17:23:53_11216_8bad9401-1ec8-4c97-bc1a-e0b35b58f234]([Token] nvarchar(102) collate Latin1_General_CS_AS_KS_WS,[ColumnNumber] int,[TokenProp] int,[Freq] int,[Rids] nvarchar(4000) collate SQL_Latin1_General_CP1_CI_AS)with(KEEP_NULLS)

exec [sys].sp_describe_undeclared_parameters N'SELECT [Rids] FROM [tempdb]..[##FuzzyLookupMatchIndex_131003_17:23:53_11216_8bad9401-1ec8-4c97-bc1a-e0b35b58f234] WHERE [Token] = @P1 AND [ColumnNumber] = @P2 AND [TokenProp] = @P3 ORDER BY [BucketNumber]'

declare @p1 int
set @p1=1
exec sp_prepare @p1 output,N'@P1 nvarchar(102),@P2 int,@P3 int',N'SELECT [Rids] FROM [tempdb]..[##FuzzyLookupMatchIndex_131003_17:23:53_11216_8bad9401-1ec8-4c97-bc1a-e0b35b58f234] WHERE [Token] = @P1 AND [ColumnNumber] = @P2 AND [TokenProp] = @P3 ORDER BY [BucketNumber]',1
select @p1

exec [sys].sp_describe_first_result_set N'SELECT [Rids] FROM [tempdb]..[##FuzzyLookupMatchIndex_131003_17:23:53_11216_8bad9401-1ec8-4c97-bc1a-e0b35b58f234] WHERE [Token] = @P1 AND [ColumnNumber] = @P2 AND [TokenProp] = @P3 ORDER BY [BucketNumber]',N'@P1 nvarchar(102),@P2 int,@P3 int',1

declare @p1 int
set @p1=1
exec sp_prepare @p1 output,NULL,N'SELECT TOP 0 [Column0], [Column1], [_key_in], [_key_in] FROM [tempdb]..[##FLInRef_131003_17:23:53_14044_8ba8d4c1-ce9f-41ef-96ea-e55bce5ce8e4]',1
select @p1

exec [sys].sp_describe_first_result_set N'SELECT TOP 0 [Column0], [Column1], [_key_in], [_key_in] FROM [tempdb]..[##FLInRef_131003_17:23:53_14044_8ba8d4c1-ce9f-41ef-96ea-e55bce5ce8e4]',NULL,1

SELECT TOP 4294967295 [Column0], [Column1], [_key_in], [_key_in] as [RID_131003_17:23:53_11216_ed0111a6-9ed8-4856-b85a-682457f59c49] FROM [tempdb]..[##FLInRef_131003_17:23:53_14044_8ba8d4c1-ce9f-41ef-96ea-e55bce5ce8e4] ORDER BY [RID_131003_17:23:53_11216_ed0111a6-9ed8-4856-b85a-682457f59c49]

exec [sys].sp_describe_undeclared_parameters N'SELECT [Column0], [Column1], [_key_in], [_key_in] FROM [tempdb]..[##FLInRef_131003_17:23:53_14044_8ba8d4c1-ce9f-41ef-96ea-e55bce5ce8e4] WHERE [_key_in]=@P1'

declare @p1 int
set @p1=1
exec sp_prepare @p1 output,N'@P1 int',N'SELECT [Column0], [Column1], [_key_in], [_key_in] FROM [tempdb]..[##FLInRef_131003_17:23:53_14044_8ba8d4c1-ce9f-41ef-96ea-e55bce5ce8e4] WHERE [_key_in]=@P1',1
select @p1

exec [sys].sp_describe_first_result_set N'SELECT [Column0], [Column1], [_key_in], [_key_in] FROM [tempdb]..[##FLInRef_131003_17:23:53_14044_8ba8d4c1-ce9f-41ef-96ea-e55bce5ce8e4] WHERE [_key_in]=@P1',N'@P1 int',1

insert bulk [tempdb]..[##FLOut_131003_17:23:53_14044_8ba8d4c1-ce9f-41ef-96ea-e55bce5ce8e4]([Column0_in] nvarchar(102) collate SQL_Latin1_General_CP1_CI_AS,[Column0_ref] nvarchar(102) collate SQL_Latin1_General_CP1_CI_AS,[Column0_sim] float,[Column1_in] smallint,[Column1_ref] smallint,[_key_in] int,[_key_out] int,[_score] float)

CREATE CLUSTERED INDEX GDBMOutIndex ON [tempdb]..[##FLOut_131003_17:23:53_14044_8ba8d4c1-ce9f-41ef-96ea-e55bce5ce8e4](_key_out)

declare FGCursor cursor LOCAL for select _key_out from [tempdb]..[##FLOut_131003_17:23:53_14044_8ba8d4c1-ce9f-41ef-96ea-e55bce5ce8e4] group by _key_out order by sum(_score) desc declare @key_out int open FGCursor fetch next from FGCursor into @key_out while @@fetch_status=0 begin insert into [tempdb]..[##FGOut_131003_17:23:53_14044_8ba8d4c1-ce9f-41ef-96ea-e55bce5ce8e4] select * from [tempdb]..[##FLOut_131003_17:23:53_14044_8ba8d4c1-ce9f-41ef-96ea-e55bce5ce8e4] o where _key_out = @key_out and not exists (select 1 from [tempdb]..[##FGOut_131003_17:23:53_14044_8ba8d4c1-ce9f-41ef-96ea-e55bce5ce8e4] i where i._key_in = o._key_in or i._key_in = o._key_out) fetch next from FGCursor into @key_out end deallocate FGCursor

SELECT Column0_in, Column0_ref, Column0_sim, Column1_in, Column1_ref, _key_in, _key_out, _score FROM [tempdb]..[##FGOut_131003_17:23:53_14044_8ba8d4c1-ce9f-41ef-96ea-e55bce5ce8e4] order by _key_out, _score desc

insert bulk [FuzzyGroupingTest1]([_key_in] int,[_key_out] int,[_score] real,[PartName] nvarchar(50) collate SQL_Latin1_General_CP1_CI_AS,[NumberOfPartsTaken] smallint,[PartName_clean] nvarchar(50) collate SQL_Latin1_General_CP1_CI_AS,[NumberOfPartsTaken_clean] smallint,[_Similarity_PartName] real)with(TABLOCK,CHECK_CONSTRAINTS)

drop table [tempdb]..[##FLInRef_131003_17:23:53_14044_8ba8d4c1-ce9f-41ef-96ea-e55bce5ce8e4]

 What's going on here?

Well, it looks to be leveraging SQL Server Global Temp tables, adding Clustered Indexes, Loopoing through Cursors with Group By and Order By and finally Inserting Bulk results into the SQL Server table designated in the SSIS package with a final drop temp table statement.

This all occurred in basically a few seconds.

The issue is, what happens when processing 10s of Millions of records, what is causing the bottlenecks?  As most of us know that Cursors are slow and using Temp tables chews up the resources on TempDB.

So that's the reason for running the SQL Profiler.

I've read that the heavy lifting of these temp tables behind the scenes should not be run on a production box.

However, I still need to get to the root cause of the slowness.

Is it memory related?  CPU Speed?  Number of CPUs?  Network Speed?  Disk IO Speed?  Is something else running on the PC?

Hopefully, with more analysis, the answer will be revealed.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.

Thoughts to Ponder