Subscribe

RSS Feed (xml)

Powered By

Skin Design:
Free Blogger Skins

Powered by Blogger


Saturday, September 27, 2008

Reduce Aggravating Aggregation: Improve the Performance of History or Status Tables

A third-party system that I manage has “history” or “status” tables following a common pattern: in the database, there is some entity (in my case, apartment units) that is labeled with a value indicating the entity’s status, and the value changes over time. Each of these entities has exactly one value representing the current status, but each also has a series of past values and dates or timestamps, tracing the history of the entity status and when it changed. For apartment units, for example, a given unit might be vacant, then reserved for a new resident, then occupied, then occupied but with the resident having given notice, then vacant again, and so on. It’s vital to track the dates when the status value changes, so the vendor uses a history table to store the unit status, and looks up the maximum date when interrogating the database for the current status.

I was facing a performance problem rooted in that design, which got me thinking about a general case. -What follows illustrates the issue using a generic orders/order status example, as opposed to the apartment units in my real case, in an effort to make this more broadly applicable.

Suppose we are faced with the canonical set of “order” records, and as the orders are processed in our company, each one has a status value that advances through a sequence, let’s say from “Fulfillment” to “Stocking,” “Packaging,” “Shipping,” and so on. At any given time each order has only one status, but we need to keep track of when each order had each status and allow the database to record the dates for each.

A highly simplified version might be implemented with tables orders, customers, statuses and orderStatus, like this:

CREATE TABLE orders (
orderID int NOT NULL IDENTITY(1, 1),
customerID int NOT NULL,
orderDate datetime NOT NULL,
description nvarchar(255) NOT NULL
)
GO

ALTER TABLE orders
ADD CONSTRAINT PK_orders PRIMARY KEY (orderID)
GO

CREATE INDEX I_orders_customerid ON orders (customerID)
GO

CREATE TABLE customers (
customerID int NOT NULL IDENTITY(1, 1),
firstName nvarchar(100) NOT NULL,
lastName nvarchar(100) NOT NULL
)
GO

ALTER TABLE customers
ADD CONSTRAINT PK_customers PRIMARY KEY (customerID)
GO

ALTER TABLE orders
ADD CONSTRAINT FK_orders_customers
FOREIGN KEY (customerID) REFERENCES customers (customerID)
GO

CREATE TABLE statuses (
status nvarchar(50) NOT NULL
)
GO

ALTER TABLE statuses
ADD CONSTRAINT PK_statuses PRIMARY KEY (status)
GO

CREATE TABLE orderStatus (
orderID int NOT NULL,
statusDate datetime NOT NULL,
status nvarchar(50) NOT NULL
)
GO

ALTER TABLE orderStatus
ADD CONSTRAINT PK_orderStatus PRIMARY KEY (orderID, statusDate)
GO

ALTER TABLE orderStatus
ADD CONSTRAINT FK_orderStatus_orders
FOREIGN KEY (orderID) REFERENCES orders (orderID)
GO

ALTER TABLE orderStatus
ADD CONSTRAINT FK_orderStatus_statuses
FOREIGN KEY (status) REFERENCES statuses (status)
GO

As a test case, I’ve populated the tables with sample data as follows. In another database, I have two tables of random first and last names, created from files the US Census bureau publishes on the web. Using those tables, I created 100,000 customer records in the test tables with random combinations of first and last names:

INSERT INTO customers ( firstName, lastName )
SELECT TOP 100000 first.name, last.name
FROM ( SELECT TOP 1000 [name] FROM sampdatafnames ORDER BY NEWID() ) [first]
CROSS JOIN ( SELECT TOP 1000 [name] FROM sampdatalnames ORDER BY NEWID() ) [last]
ORDER BY NEWID()

I also have a small table of “sample item names,” with which I generated 1,000,000 sample orders, with dates spanning the last year, associated to random customers:

DECLARE @i INT
SET @i = 0

BEGIN TRAN
WHILE @i < 1000000 BEGIN
INSERT INTO orders ( customerID, orderDate, [description] )
SELECT
-- Random customer:
CAST( 100000 * RAND( CAST( CAST( NEWID() AS BINARY(8) ) AS INT)) AS INT ) customerid,
-- Random recent date:
DATEADD( day, -365 * RAND( CAST( CAST( NEWID() AS BINARY(8) ) AS INT)), GETDATE() ) orderDate,
-- Random order item:
( SELECT TOP 1 name FROM sampdataitems ORDER BY NEWID() ) [description]
SET @i = @i + 1
END
COMMIT

When using this design, each time the status of an order is changed, the order’s new status is inserted, with a date stamp, into the orderStatus table. So, for example, if each order begins as “Fulfillment,” then a row would be inserted for every order, at its creation, with the status “Fulfillment.” When an order advances to the next status value, a second row is inserted into orderStatus with the new status value and the later date stamp. This process repeats, with the orderStatus record having the maximum date stamp representing the current status of the order.

We can simulate this situation in bulk by populating the orderStatus table with inserts, each having a date offset from the order date, thus:

/* Create initial status records for all the orders */

INSERT INTO orderStatus ( orderid, statusdate, [status] )
SELECT orderid, orderdate, 'Fulfillment' FROM orders

/* For a subset of the orders, add the next status, a few days later */
INSERT INTO orderStatus ( orderid, statusdate, [status] )
SELECT TOP 800000 orderid, DATEADD( day, 5, orderdate ), 'Stocking'
FROM orders ORDER BY NEWID()

/* For a subset of those last orders, add the next status, a few days later */
INSERT INTO orderStatus ( orderid, statusdate, [status] )
SELECT TOP 600000 orders.orderid, DATEADD( day, 5, orderStatus.statusdate ), 'Packaging'
FROM orders INNER JOIN orderStatus ON orders.orderid = orderStatus.orderID
AND orderStatus.status = 'Stocking'
ORDER BY NEWID()

/* Repeat… */
INSERT INTO orderStatus ( orderid, statusdate, [status] )
SELECT TOP 400000 orders.orderid, DATEADD( day, 2, orderStatus.statusdate ), 'Shipping'
FROM orders INNER JOIN orderStatus ON orders.orderid = orderStatus.orderID
AND orderStatus.status = 'Packaging'
ORDER BY NEWID()

INSERT INTO orderStatus ( orderid, statusdate, [status] )
SELECT TOP 200000 orders.orderid, DATEADD( day, 2, orderStatus.statusdate ), 'Shipped'
FROM orders INNER JOIN orderStatus ON orders.orderid = orderStatus.orderID
AND orderStatus.status = 'Shipping'
ORDER BY NEWID()

INSERT INTO orderStatus ( orderid, statusdate, [status] )
SELECT TOP 100000 orders.orderid, DATEADD( day, 2, orderStatus.statusdate ), 'Received'
FROM orders INNER JOIN orderStatus ON orders.orderid = orderStatus.orderID
AND orderStatus.status = 'Shipped'
ORDER BY NEWID()

When those inserts complete, we have a good set of data simulating this design: many orders, with a variety of status values over a time span.

One key bit of context here is that we most often want to know the current status of an order or set of orders and, while the past values are important, they don’t make up the majority of queries against this data. Many queries are concerned with the current status of orders, or with finding orders that have a particular value as their current status. So the type of thing we typically want has this sort of structure:

/* Typical query to find the latest order status date */
SELECT MAX(statusdate) FROM orderStatus GROUP BY orderid


/* or, more specifically: */
SELECT o.orderID,
o.customerID,
o.orderDate,
o.description,
os.statusDate,
os.status
FROM orders o
INNER JOIN (
SELECT orderid, MAX(statusdate) maxdate FROM orderStatus GROUP BY orderid
) lastStatusDates ON o.orderID = lastStatusDates.orderid
INNER JOIN orderStatus os
ON o.orderID = os.orderID AND lastStatusDates.maxdate = os.statusDate

Notice the aggregation required.- We have to go into the table and find the max(statusDate) for each order, then fetch the corresponding status value. This requires two trips to the table and two joins for many queries. It works, but it’s not the best performer, and on a busy system all that additional aggregation and join traffic might slow things down.

It’s possible to get much better select performance if we partition out the current status rows (they are what we care about most of the time) from the history rows, using two tables and a simple trigger. In place of the single orderStatus table, I’ll create a pair of tables, orderStatusCurrent and orderStatusHistory. This will give us a very basic form of “poor man’s partitioning,” sans the complexity and cost of the Enterprise feature by that name.

CREATE TABLE orderStatusCurrent (
orderid int NOT NULL,
statusDate datetime NOT NULL,
status nvarchar(50) NOT NULL
)
GO

ALTER TABLE orderStatusCurrent
ADD CONSTRAINT PK_orderStatusCurrent PRIMARY KEY (orderid)
GO

ALTER TABLE orderStatusCurrent
ADD CONSTRAINT FK_orderStatusCurrent_orders
FOREIGN KEY (orderid) REFERENCES orders (orderID)
GO

ALTER TABLE orderStatusCurrent
ADD CONSTRAINT FK_orderStatusCurrent_statuses
FOREIGN KEY (status) REFERENCES statuses (status)
GO

CREATE TABLE orderStatusHistory (
orderID int NOT NULL,
statusDate datetime NOT NULL,
status nvarchar(50) NOT NULL
)
GO

ALTER TABLE orderStatusHistory
ADD CONSTRAINT PK_orderStatusHistory PRIMARY KEY (orderID, statusDate)
GO

ALTER TABLE orderStatusHistory
ADD CONSTRAINT FK_orderStatusHistory_orders
FOREIGN KEY (orderID) REFERENCES orders (orderID)
GO

ALTER TABLE orderStatusHistory
ADD CONSTRAINT FK_orderStatusHistory_statuses
FOREIGN KEY (status) REFERENCES statuses (status)

GO



To avoid changing the application code that creates orderStatus rows, it’s possible to mimic the insert behavior of the original design by adding a trigger to the orderStatusCurrent table. The trigger will, upon insert, move any existing rows for the same orders to the orderStatusHistory table, and then insert the new rows. In this way, the application can continue with the simple model of inserting status rows as needed, but the database engine will automatically separate the latest rows from the older ones.

The trigger code looks like this:

CREATE TRIGGER moveOrderStatusToHistory ON orderStatusCurrent INSTEAD OF INSERT AS INSERT INTO orderStatusHistory( orderid, statusDate, status ) SELECT existing.orderid, existing.statusDate, existing.status FROM orderStatusCurrent existing INNER JOIN inserted ON existing.orderid = inserted.orderid DELETE FROM orderStatusCurrent FROM orderStatusCurrent existing INNER JOIN inserted ON existing.orderid = inserted.orderid INSERT INTO orderStatusCurrent( orderid, statusDate, status ) SELECT orderid, statusdate, [status] FROM inserted GO

A side note on triggers: whenever creating trigger code, though it can be challenging, make absolutely certain the code is set based, and can handle multiple rows. It’s a mistake ­ and I have seen it in commercial, production code ­ to assume that a table will only be manipulated one row at a time and create a trigger based on that assumption. Triggers must correctly handle sets of rows.

The trigger has three steps: it first inserts any rows that already exist in the orderStatusCurrent table, for the same orders as the rows being inserted, into the orderStatusHistory table; it then removes those rows from orderStatusCurrent. Finally, it inserts the new, incoming status rows. In effect the new rows replace the old ones, and the old ones get “archived.”

To test the trigger, we can run the same sequence of inserts against the new orderStatusCurrent table, and watch to make sure that a) exactly one row per order remains in the table, and b) the older rows are getting moved to orderStatusHistory correctly:

/* Test against one order */ SELECT * FROM orderStatusCurrent INSERT INTO orderStatusCurrent( orderid, statusDate, status ) VALUES ( 1000, GETDATE(), 'Fulfillment' ) SELECT * FROM orderStatusCurrent INSERT INTO orderStatusCurrent( orderid, statusDate, status ) VALUES ( 1000, GETDATE(), 'Stocking' ) SELECT * FROM orderStatusCurrent SELECT * FROM orderStatusHistory INSERT INTO orderStatusCurrent( orderid, statusDate, status ) VALUES ( 1000, GETDATE(), 'Packaging' ) SELECT * FROM orderStatusCurrent SELECT * FROM orderStatusHistory INSERT INTO orderStatusCurrent( orderid, statusDate, status ) VALUES ( 1000, GETDATE(), 'Shipping' ) SELECT * FROM orderStatusCurrent SELECT * FROM orderStatusHistory DELETE FROM orderStatusCurrent DELETE FROM orderStatusHistory GO

If that works as expected, then we can run similar insert code to the first example, to populate the status records for all orders with test data:

INSERT INTO orderStatusCurrent ( orderid, statusdate, [status] ) SELECT orderid, orderdate, 'Fulfillment' FROM orders INSERT INTO orderStatusCurrent ( orderid, statusdate, [status] ) SELECT TOP 800000 orderid, DATEADD( day, 5, orderdate ), 'Stocking' FROM orders ORDER BY NEWID() INSERT INTO orderStatusCurrent ( orderid, statusdate, [status] ) SELECT TOP 600000 orders.orderid, DATEADD( day, 5, orderStatusCurrent.statusdate ), 'Packaging' FROM orders INNER JOIN orderStatusCurrent ON orders.orderid = orderStatusCurrent.orderID AND orderStatusCurrent.status = 'Stocking' ORDER BY NEWID() INSERT INTO orderStatusCurrent ( orderid, statusdate, [status] ) SELECT TOP 400000 orders.orderid, DATEADD( day, 2, orderStatusCurrent.statusdate ), 'Shipping' FROM orders INNER JOIN orderStatusCurrent ON orders.orderid = orderStatusCurrent.orderID AND orderStatusCurrent.status = 'Packaging' ORDER BY NEWID() INSERT INTO orderStatusCurrent ( orderid, statusdate, [status] ) SELECT TOP 200000 orders.orderid, DATEADD( day, 2, orderStatusCurrent.statusdate ), 'Shipped' FROM orders INNER JOIN orderStatusCurrent ON orders.orderid = orderStatusCurrent.orderID AND orderStatusCurrent.status = 'Shipping' ORDER BY NEWID() INSERT INTO orderStatusCurrent ( orderid, statusdate, [status] ) SELECT TOP 100000 orders.orderid, DATEADD( day, 2, orderStatusCurrent.statusdate ), 'Received' FROM orders INNER JOIN orderStatusCurrent ON orders.orderid = orderStatusCurrent.orderID AND orderStatusCurrent.status = 'Shipped' ORDER BY NEWID()

After these inserts, the current status value for each order will be in the target table, and the trigger will have moved the previous status values to the history table. Now we can run some comparisons to see what sort of performance improvement this yields.

First, a comparison of a large query reporting the current status of each order:

/* Select using the separate history table */ SELECT o.orderID, o.customerID, o.orderDate, o.description, os.statusDate, os.status FROM orders o INNER JOIN orderStatusCurrent os ON o.orderID = os.orderID OPTION( MAXDOP 1 ) /* Select using the original single status table */ SELECT o.orderID, o.customerID, o.orderDate, o.description, os.statusDate, os.status FROM orders o INNER JOIN ( SELECT orderid, MAX(statusdate) maxdate FROM orderStatus GROUP BY orderid ) lastStatusDates ON o.orderID = lastStatusDates.orderid INNER JOIN orderStatus os ON o.orderID = os.orderID AND lastStatusDates.maxdate = os.statusDate OPTION( MAXDOP 1 )

Note: I am using MAXDOP 1 to keep these test cases single-threaded, for a simpler apples-to-apples comparison. Analyzing them with parallelism is outside the scope of this exercise.

The optimizer reports that the first query, with the separate history table, has an estimated cost around 15; the original design shows a cost of 33.4, so we are getting something around twice the performance from splitting the table.

Another sample query involves fetching all orders having a specific status:

SELECT o.orderID, o.customerID, o.orderDate, o.description, os.statusDate, os.status FROM orders o INNER JOIN orderStatusCurrent os ON o.orderID = os.orderID WHERE os.status = 'Packaging' OPTION( MAXDOP 1 ) SELECT o.orderID, o.customerID, o.orderDate, o.description, os.statusDate, os.status FROM orders o INNER JOIN ( SELECT orderid, MAX(statusdate) maxdate FROM orderStatus GROUP BY orderid ) lastStatusDates ON o.orderID = lastStatusDates.orderid INNER JOIN orderStatus os ON o.orderID = os.orderID AND lastStatusDates.maxdate = os.statusDate WHERE os.status = 'Packaging' OPTION( MAXDOP 1 )

Here the difference is even more: the split status tables have a query plan cost of 14, compared to 53 for the original design. Finally, orders for a specific customer:

SELECT o.orderID, o.customerID, o.orderDate, o.description, os.statusDate, os.status FROM orders o INNER JOIN orderStatusCurrent os ON o.orderID = os.orderID WHERE o.customerID = 12345 OPTION( MAXDOP 1 ) SELECT o.orderID, o.customerID, o.orderDate, o.description, os.statusDate, os.status FROM orders o INNER JOIN ( SELECT orderid, MAX(statusdate) maxdate FROM orderStatus GROUP BY orderid ) lastStatusDates ON o.orderID = lastStatusDates.orderid INNER JOIN orderStatus os ON o.orderID = os.orderID AND lastStatusDates.maxdate = os.statusDate WHERE o.customerID = 12345 OPTION( MAXDOP 1 )

Cost for the original design 0.096; with the status table split, that improves to 0.063.

The one case that becomes slightly tricky is gathering all the status values for a particular order. In the original design, this is simply

SELECT o.orderID, o.customerID, o.orderDate, o.description, os.statusDate, os.status FROM orders o INNER JOIN orderStatus os ON o.orderID = os.orderID WHERE o.orderID = 12345

But with the new design we are forced to do some sort of union to get both the current and past values, like these two examples:

SELECT o.orderID, o.customerID, o.orderDate, o.description, st.statusDate, st.status FROM orders o INNER JOIN ( select * from orderStatusCurrent os UNION ALL select * from orderStatusHistory ) st ON o.orderID = st.orderID WHERE o.orderID = 12345 SELECT o.orderID, o.customerID, o.orderDate, o.description, os.statusDate, os.status FROM orders o INNER JOIN orderStatusCurrent os ON o.orderID = os.orderID WHERE o.orderID = 12345 UNION ALL SELECT o.orderID, o.customerID, o.orderDate, o.description, os.statusDate, os.status FROM orders o INNER JOIN orderStatusHistory os ON o.orderID = os.orderID WHERE o.orderID = 12345

Here, unsurprisingly, the original design costs about 30% less, but (this did surprise me) of the other two examples, the first is about 25% faster than the second. The optimizer does a very good job at refactoring the derived table containing the union.

There is some cost for this improvement in select performance, as one would expect. Instead of a single insert, a status change for an order record would require two inserts and a delete, moving the old record from one table to the other. So if the speed of inserts is the only concern, this change might not be advisable; however, on many systems (including the one I was working with) the select/reporting traffic outweighs the transactional load, often many times over, so this change would probably help overall.

What I took away from this study was this: for scalability, it’s obviously not wise to mix rows of different kinds in the same table ­ but that rule can apply even when the difference is as subtle as current vs. past values, that otherwise have exactly the same structure.


No comments:

Post a Comment

Recent Posts

Archives