By design, the SQL programming language works with data sets, so it is not surprising that for most problems, row by row processing is not nearly as efficient as data set processing. There is a class of problems, however, that is much easier to solve using cursors. A typical "cursor friendly" problem is one where the data set returned contains at least one column whose value depends on column values from one or more previous rows of the same row set. Even when a data set based solution exists, it is hard to build a query that is more efficient than a cursor based solution. Usually, advanced coding techniques have to be used, and sometimes a special index design is required for the tables involved. For that type of problem, it is usually better to send a simple row set to the calling application and let the client cycle through the rows and calculate the column value derived.
The Problem
Let's define the problem as displaying cumulative sales for a specific store and product. This is a simplified example since I haven't included the sales period, which would be more realistic. I don't want to include a discussion about handling time periods as a part of the criteria in Transact-SQL. That topic deserves a separate article.
Let's say there is a sales table with the structure below:
CREATE TABLE [dbo].[Sales] (
[TransactionID] [int] IDENTITY (1, 1) NOT NULL,
[StoreID] [int] NOT NULL,
[productID] [int] NOT NULL,
[transactionTime] [datetime] NOT NULL,
[amount] [money] NOT NULL
) ON [PRIMARY]
GO
ALTER TABLE [dbo].[Sales] WITH NOCHECK ADD
CONSTRAINT [PK_Sales] PRIMARY KEY CLUSTERED
(
[TransactionID]
) ON [PRIMARY]
GO
To populate the table you can use this script:
set noCount on
declare @i int
set @i = 0
while @i < 100000 begin
insert into dbo.Sales(StoreID, productId, transactionTime, amount)
select 1,
p.productID,
dateAdd(minute, - @i, getDate()),
case (p.ProductID)
when 1 then 10
when 2 then 100
end
from (select 1 as productID union all select 2 as productID) as p
if @i % 1000 = 0 print @i
set @i = @i+1
end
set nocount off
go
set noCount on
declare @i smallint
set @i = 0
while @i < 1000 begin
insert into dbo.Sales(StoreID, productId, transactionTime, amount)
select 2,
p.productID,
dateAdd(minute, - @i, getDate()),
case (p.ProductID)
when 1 then 10
when 2 then 100
end
from (select 1 as productID union all select 2 as productID) as p
set @i = @i+1
end
set nocount off
go
Finally, we need an index on StoreID and ProductID:
CREATE INDEX [IX_Sales] ON [dbo].[Sales]([StoreID], [productID], [transactionID]) ON [PRIMARY]
GO
We want the script below:
execute dbo.Sales_sel_by_StoreID_ProductID @storeID = 2, @productID = 1
To output this row set:
transactionID | transactionTime | amount | total |
---|---|---|---|
200001 | 29.12.2005 17:18 | 10.0000 | 10.0000 |
200003 | 29.12.2005 17:19 | 10.0000 | 20.0000 |
200005 | 29.12.2005 17:20 | 10.0000 | 30.0000 |
200007 | 29.12.2005 17:21 | 10.0000 | 40.0000 |
200009 | 29.12.2005 17:22 | 10.0000 | 50.0000 |
200011 | 29.12.2005 17:23 | 10.0000 | 60.0000 |
200013 | 29.12.2005 17:24 | 10.0000 | 70.0000 |
200015 | 29.12.2005 17:25 | 10.0000 | 80.0000 |
200017 | 29.12.2005 17:26 | 10.0000 | 90.0000 |
200019 | 29.12.2005 17:27 | 10.0000 | 100.0000 |
… | … | … | … |
Cursor Solution
The simplest and at the same time almost the most efficient solution involves the use of a cursor:
create procedure dbo.Sales_sel_by_StoreID_ProductID
@StoreID int,
@ProductID int
as begin
set noCount on
declare @report table(
transactionID int primary key clustered,
transactionTime dateTime not null,
amount money not null,
total money not null
)
declare runningTotalsCursor cursor for
select transactionID, TransactionTime, Amount
from dbo.Sales
where
StoreID = @StoreID and
ProductID = @ProductID and
order by
transactionID
declare @transactionID int
declare @transactionTime dateTime
declare @amount money
declare @total money
set @total = 0
open RunningTotalsCursor
while (0=0) begin
fetch next from RunningTotalsCursor into @transactionID, transactionTime, @amount
if @@fetch_status <> 0 break
set @total = @total + @amount
insert into @report(transactionID, transactionTime, amount, total) values(@transactionID, @transactionTime, @amount, @total)
end -- while
close runningTotalsCursor
deallocate runningTotalsCursor
select transactionID, transactionTime, amount, total from @report order by transactionID
set noCount off
end
If you use SQL Profiler to test the efficiency of the solution, pay attention to the duration, reads, and CPU counter.
Correlated Query Solution
The simplest and the least efficient solution using a data set approach is one that takes advantage of a correlated query. This is almost the definition of a running total: Return all transaction IDs, transaction times, and amounts for a specific product, and store alongside it the sum of all amounts from the first transaction to the current one:
create procedure dbo.Sales_sel_by_StoreID_ProductID
@StoreID int,
@ProductID int
as begin
set noCount on
select top 10
a.transactionID,
a.transactionTime,
a.amount,
(select sum(amount)
from
dbo.Sales b
where
b.transactionID <= a.TransactionID and
b.StoreID = @StoreID and
b.ProductID = @ProductID
) as total
from
dbo.Sales a
where
a.StoreID = @StoreID and
a.ProductID = @ProductID
order by transactionID
set noCount off
end
This query will have N*(N+1)/2 row-reads from the sales table where N is the number of rows returned by the procedure. The consequence is terrible performance compared to the cursor solution. There is a similar solution using join and group by, that also needs N*N magnitude of reads.
Table Variable Solution
The next solution is the most efficient pure Transact-SQL one I have figured out so far. To reach that solution let's look at Books Online (BOL), update statement topic, local variables in set clause:
@variable
Is a declared variable that is set to the value returned by expression.
SET @variable = column = expression sets the variable to the same value as the column. This differs from SET @variable = column, column = expression, which sets the variable to the pre-update value of the column.
That means we can use a stored procedure local variable to save the cumulative total and update the total column of the table variable we created to return the result set. Here is the code:
create procedure dbo.Sales_sel_by_StoreID_ProductID
@StoreID int,
@ProductID int
as begin
set noCount on
declare @report table(
transactionID int primary key clustered,
transactionTime dateTime not null,
amount money not null,
total money not null
)
declare @total money
set @total = 0
insert into @report(transactionID, transactionTime, amount, total)
select a.transactionID,
a.transactionTime,
a.amount,
0.0 as total
from
dbo.Sales a
where
a.StoreID = @StoreID and
a.ProductID = @ProductID
update r
set @total = r.total = @total + amount
from @report r
select r.transactionID, r.transactionTime, r.amount, r.total
from @report r
order by r.transactionID
set noCount off
end
This is the most efficient data set based solution! However, there is a small problem with this approach. How do we know that the update will be done exactly in the order of increasing transactionID? Is it enough to have a clustered primary key on that column to be safe? I think so, but I can't prove it. A future service pack could change the way the execution plan for this procedure is built. Even though I don't expect the way critical update executes to change any time soon, to be absolutely safe we have to force the update order more explicitly. One way to do that would be to use a derived table with an order by clause. An order by clause is not allowed in derived tables unless the "top" option is used too. So one possible "safe" solution is below:
create procedure dbo.Sales_sel_by_StoreID_ProductID
@StoreID int,
@ProductID int
as begin
set noCount on
declare @report table(
transactionID int primary key clustered,
amount money not null,
total money not null
)
declare @total money
set @total = 0
insert into @report(transactionID, amount, total)
select a.transactionID,
a.amount,
0.0 s total
from
dbo.Sales a (index=ix_sales)
where
a.StoreID = @StoreID and
a.ProductID = @ProductID
update r
set @total = r.total = @total + amount
from (select top 100 percent
transactionID
from @report
order by transactionID) as s
inner loop join @report r on r.transactionID = s.transactionID
option (force order)
select s.transactionID, s.transactionTime, s.amount, r.total
from @report r
join dbo.sales s on s.transactionID = r.transactionID
order by r.transactionID
set noCount off
end
However, if you try to change the order to descending you may be surprised by the result. The problem is that "select top N … order by" guarantees that the first N rows based on order will be returned, but the order they are returned is not guaranteed.
The only reliable way to force the update order is to force an index on the sales table:
create procedure dbo.Sales_sel_by_StoreID_ProductID
@StoreID int,
@ProductID int
as begin
set noCount on
declare @report table(
transactionID int primary key clustered,
total money not null
)
declare @total money
set @total = 0
insert into @report(transactionID, total)
select a.transactionID,
0.0 as total
from
dbo.Sales a (index=ix_sales)
where
a.StoreID = @StoreID and
a.ProductID = @ProductID
update r
set @total = r.total = @total + amount
from
dbo.Sales s (index=ix_sales)
join @report r on r.transactionID = s.transactionID
where
s.StoreID = @StoreID and
s.ProductID = @ProductID
option (force order)
select s.transactionID, s.transactionTime, s.amount, r.total
from @report r
join dbo.sales s on s.transactionID = r.transactionID
order by r.transactionID
set noCount off
end
It was not enough to have the sales table rows ordered using the index and joined with the table variable. We had to make sure that the inner table used is a derived one. For that reason, we added option (force order). The force order hint doesn't force the data order; instead, it actually forces the join order.
Quote from BOL:
FORCE ORDER
Specifies that the join order indicated by the query syntax is preserved during query optimization.
On top of that, we didn't want any fancy join algorithm to change the order in which rows are updated, so we used a loop join hint.
This solution is slower than the first table variable based solution but it is still more efficient than a solution using a cursor. The solution will work as long as Microsoft supports the query hints used.
Conclusion
None of the solutions mentioned is the one I prefer. The solution I recommend is to return a "raw" data set to the client and let the client loop through rows and calculate the running total. The query returning "raw" data is the fastest and the least resource intensive. Implementation of running total calculation is fast and straight forward for programming languages used on the client side.
No comments:
Post a Comment