MS SQL Server: Select Based on List of Random Values

Two frequently asked related questions in Microsoft SQL Server discussion groups are:

How to pass an array to a stored procedure?
How to select data based on a list of values?

Examples:

User wants to search for jobs based on a list of desired locations.
User marks e-mails in an inbox for deletion.

SQL data set based processing doesn't support well such random-list based criteria. Sure, there is the IN operator, but how do you pass such a list to a stored procedure and how do you achieve a good execution plan without a recompilation each time the query is executed?

The Problem

Let us prepare a concrete scenario in order to be able to test and compare different solutions. Let's define a table structure and a stored procedure that handles orders that may have different statuses during its life cycle. For example, an order may be submitted, reviewed, approved, waiting for approval, in process, rejected, closed, and so on.

Let's define the orderStatuses table first:

create table dbo.OrderStatuses(
orderStatusID tinyInt primary key,
orderStatusName varchar(30),
orderStatusDescription varchar(255)
)

For testing purposes, we can use a simplified orders table that contains references to the user table, orderStatuses table, and additional info represented by varchar(200) column:

create table orders(
orderID int identity(1,1) primary key clustered,
userID int not null,
orderStatusID tinyInt not null,
orderInfo varchar(200)
)

In this article we will review a few implementations of the stored procedure that returns a count of orders for a given user and a list of statuses. The stored procedure structure is shown below:

create procedure orders_sel_by_userID_orderStatusList
@userId int,
@orderStatusList varchar(100)
as begin
set noCount on

end

To populate the table we can use the script below:

set noCount on
declare @StatusCount int set @StatusCount = 7
declare @UserCount int set @UserCount = 50

declare @Count int set @count = @StatusCount * @UserCount * 10000
declare @i int set @i = 1

while @i <= @count begin
insert into orders(userID, orderStatusId, orderInfo) values(@i % @UserCount + 1, @i % @StatusCount + 1, 'IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII')
set @i = @i + 1
if @i % 10000 = 0 print @i
end
set noCount off

Finally, we need an index on userID and orderStatusID:

create index orders_UserID_orderStatusID on orders(userID, orderStatusID)

To test performance of each implementation I used this script:
(ZIP - 3.12 KB)

I tested performance when just one statusID is passed, when two and three statuses are passed, and when different parameters are passed in subsequent executions.

Dynamic SQL Solution

The simplest, and at the same time almost the most efficient, solution involves the use of dynamic SQL:

create procedure orders_sel_by_userID_orderStatusList
@userId int,
@orderStatusList varchar(100)
as begin
set noCount on
declare @query nvarchar(1000)
declare @parameters nvarchar(100)
set @query = N'Select count(*) from orders where userID = @userID and OrderStatusID in (' + @orderStatusList + ')'
set @parameters = N'@userID int'
exec sp_executeSQL @query, @parameters, @userId = @userID
set noCount off
end

Many Microsoft SQL Server specialists discourage use of dynamic SQL because of SQL injection attack possibilities. However, in this case, the list of values would not be typed in a text box. The list would rather be generated after some kind of control, providing multiple choices to use.

The advantage of this solution is that each query contains concrete values so Query Optimizer has maximum information for the cost calculation.

One drawback of the dynamic SQL solution is that for each distinct list, a separate execution plan will be compiled. That may cause high CPU usage due to high stored procedure cache misses and inserts if the query pattern involves frequent executions with different lists of statuses.

Another problem is that read rights to tables involved are necessary for successfully running such a stored procedure.

The optimal execution plan should use an index seek for each userID and OrderStatusID value combination. That way we would have the least number of reads. For example, if "1,4,7" is passed as the @orderStatusList value, the next three index seeks would be used:

index seek on userID = @userID and orderStatusID = 1
index seek on userID = @userID and orderStatusID = 4
index seek on userID = @userID and orderStatusID = 7

This means that only index entries that satisfy the condition would be accessed.

When testing dynamic SQL implementation on different servers, I haven't had consistent execution plans. In most cases index scans had been used. Comparing the number of reads of other solutions I came to the conclusion that most if not all index rows for a specific userID value had been read and then orderStatusIDs with values outside of the list had been filtered out. This means that all index entries with userID = @userID is read and then orderStatusID values are compared with values from the list. This way, entries with orderStatusID values 2, 3, 5 and 6 are accessed too.

In order to avoid both a caching issue and range scan based on userID value only, we need to use a table with values from the list and then join it with the main (orders) table. The next two solutions use that approach.

Solution Based on Table Function

The most popular solution for this kind of problem is joining the orders table with the table function that parses a list of values. The code for the procedure is straightforward:

create procedure orders_sel_by_userID_orderStatusList
@userId int,
@orderStatusList varchar(100)
as begin
Select count(*)
from orders o
join dbo.listOfValues(@orderStatusList) l on o.orderStatusID = l.value
where o.userID = @userID
end

One possible implementation of the listOfValues table function is below:

CREATE FUNCTION ListOfValues (@List varchar(100))
RETURNS @ListOfValues TABLE (value int)
AS
BEGIN
DECLARE @off int
DECLARE @substr varchar(4000)
DECLARE @value varchar(4000)
DECLARE @prevOff int

SELECT @prevOff = 1
SELECT @off = CHARINDEX(',', @List, 0)

WHILE (@off > 0)
BEGIN
SELECT @value = RTRIM(SUBSTRING(@List, @prevOff, @off-@prevOff))
SELECT @prevOff = @off+1
insert into @ListOfValues (value) Values (@value)
SELECT @off = CHARINDEX(',', @List, @prevoff)
END

if (LEN(@List)-@prevOff+1 > 0)
BEGIN
set @value = RTRIM(SUBSTRING(@List, @prevOff, LEN(@List)-@prevOff+1))
insert into @ListOfValues (value) Values (@value)
END

return
END

This solution had the worst performance of all the solutions I tested. It was also the most CPU intensive and the slowest solution. It looks like the overhead of the list of value parsing overweighs the gain of a single execution plan in the cache. There was no significant performance improvement with a few other implementations of that function.

Solution Based on a Table Containing Lists of Values and Its Members

If we could avoid the list of value parsing and still have a table with all statuses requested, we would probably have the solution we are looking for. Having a table with pre-populated lists and its members would ensure the parsing step is not needed.

create table dbo.orderStatusIDListMembers(
list varchar(50),
OrderStatusID int not null,
primary key clustered (list, orderStatusID)
)

Sample data contained in the table would be:

List	OrderStatusID
'1'	1
'1,2'	1
'1,2'	2
…	…

To be able to implement this solution we have to make sure that orderStatusIDListMembers table data is in synch with the orderStatuses table. To achieve that, we need to implement procedures for inserting and deleting statuses.

create procedure dbo.OrderStatuses_ins
@orderStatusID tinyInt,
@orderStatusName varchar(30),
@orderStatusDescription varchar(255)
as begin

set noCount on
begin transaction OrderStatuses_ins

insert into dbo.OrderStatuses values(@orderStatusID, @orderStatusName, @orderStatusDescription)

if @@error <> 0 begin
rollback transaction OrderStatuses_ins
return (-1)
end

declare oldOrderStatusIDList insensitive cursor for
select distinct list
from dbo.orderStatusIDListMembers

declare @currentList varchar(50),
@currentOrderStatusID varchar(50),
@newList varchar(50)

open oldOrderStatusIDList
set @newList = null

while (0 = 0) begin

fetch next from oldOrderStatusIDList into @currentList

if @@fetch_status <> 0 break

select @newList = isNull(@newList + ', ', '') + cast(orderStatusID as varchar(50))
from (
select orderStatusID
from dbo.orderStatusIDListMembers
where list = @currentList
union
select @orderStatusID as orderStatusID
) as t
order by orderStatusID

insert into dbo.orderStatusIDListMembers(list, orderStatusID)
select @newList, orderStatusID
from (
select orderStatusID
from dbo.orderStatusIDListMembers
where list = @currentList
union
select @orderStatusID as orderStatusID
) as t

if @@error <> 0 begin
rollback transaction OrderStatuses_ins
close oldOrderStatusIDList
deallocate oldOrderStatusIDList
return(-1)

end

set @newList = null

end

close oldOrderStatusIDList
deallocate oldOrderStatusIDList

insert into dbo.orderStatusIDListMembers(list, orderStatusID) values(cast(@orderStatusID as varchar(50)), @orderStatusID)

if @@error <> 0 begin
rollback transaction OrderStatuses_ins
return(-1)
end

commit transaction OrderStatuses_ins
return 0
end

I'll omit the code of the procedure that deletes single status because it is not needed for performance testing and its implementation is straightforward.

To test the solution, let's populate the orderStatuses and orderStatusIDListMembers tables.

exec dbo.OrderStatuses_ins @orderStatusID = 0, @orderStatusName = '0', @orderStatusDescription = '0'
exec dbo.OrderStatuses_ins @orderStatusID = 1, @orderStatusName = '1', @orderStatusDescription = '1'
exec dbo.OrderStatuses_ins @orderStatusID = 2, @orderStatusName = '2', @orderStatusDescription = '2'
exec dbo.OrderStatuses_ins @orderStatusID = 3, @orderStatusName = '3', @orderStatusDescription = '3'
exec dbo.OrderStatuses_ins @orderStatusID = 4, @orderStatusName = '4', @orderStatusDescription = '4'
exec dbo.OrderStatuses_ins @orderStatusID = 5, @orderStatusName = '5', @orderStatusDescription = '5'
exec dbo.OrderStatuses_ins @orderStatusID = 6, @orderStatusName = '6', @orderStatusDescription = '6'
exec dbo.OrderStatuses_ins @orderStatusID = 7, @orderStatusName = '7', @orderStatusDescription = '7'

The stored procedure implementation is similar to the previous one:

create procedure orders_sel_by_userID_orderStatusList
@userId int,
@orderStatusList varchar(100)
as begin
set noCount on
Select count(*)
from orders o
join dbo.orderStatusIDListMembers l on o.orderStatusID = l.OrderStatusID and l.list = @orderStatusList
where o.userID = @userID

end

My tests showed that such an implementation is faster in cases where the list of values varies a lot.

The main drawback of this solution is that the number of rows in the orderStatusIDListMembers table grows exponentially with each status added. If there are too many statuses, the table will contain huge number of rows and its maintenance may become a problem. This solution is clearly not applicable to the "selected mail deletion" problem.

Conclusion

The implementation using a table function clearly doesn't perform as well as the other two solutions for the test cases included in the script. An implementation based on a pre-populated list of values members table was the fastest. However, implementing this solution may be overkill when there is a low variety of list values or there is a low frequency of the query. As usual, it is up to you to test each possible solution and apply one that suits your query pattern the best.

MS SQL Server

Subscribe

Powered By

Wednesday, October 1, 2008

Select Based on List of Random Values — Dynamic SQL or CSV Parsing UDF?

The Problem

Dynamic SQL Solution

Solution Based on Table Function

Solution Based on a Table Containing Lists of Values and Its Members

Conclusion

No comments:

Post a Comment

Recent Posts

Archives

Followers

Recent Comments