MS SQL Server: Techniques for Indexing Low-Selectivity Columns in SQL Server

Most of us have probably faced this situation at one time or another: there's a deceptively simple query in your system that's performing poorly:

SELECT col1, col2, col3 FROM aLargeTable WHERE flag = 1

On the surface, this seems like an easy thing to solve. Looking closer, though, it's likely that our offending "flag" column consists of only two values, and is a BIT column — or, what amounts to the same thing, a column of another type containing only a few different values. That means that indexing it in the usual fashion will not work, because the index will not be selective enough to provide the database engine with an advantage when selecting the rows and the optimizer is not likely to even use it. This, incidentally, would be correct behavior, because using the index to look up many individual rows would be slower than just scanning the whole table.

So, what to do? There are a few possible actions here, but some will help and some will not. What follows is an analysis of some techniques, with their performance impact, using SQL Server 2005.

To simulate this situation, we need an example with the following characteristics.

A large number of rows.
Some columns that have low or extremely low selectivity, such as BIT columns or CHAR columns with only a few different values.
The need to select from the table using the low selectivity columns as selection criteria.

I'm going to use a hypothetical mailing list table to emulate this situation. The mailing list consists of people's names, states, and two bit-column "flags" indicating whether their address is valid, and whether they have "opted in." (After all, all mailing lists allow people to opt in or out, right? In a perfect world, at least, they would.) The query we want to optimize is one that chooses every row having both of these flag values set to 1/True.

Create Sample Data

Here is the sample table:

CREATE TABLE baseTable (

aKey int identity( 10000000,1 ) PRIMARY KEY,

fname VARCHAR(50),

lname VARCHAR(100),

validAddr BIT,

getsMail BIT,

state CHAR(2)

)

I'll populate the table with a great many rows of sample data, using random selections from a cross join, broken into manageable chunks by a loop. On my test system, I have two tables of sample first and last names, obtained from the U.S. Census Bureau Web site — handy for producing legible test data — that you'll see in the select statements below. There is also a table of U.S. states. If you try something like this, be sure you have enough transaction log and disk space, and mind the overall size of the cross join to keep it reasonable:

declare @i int

set @i = 0

while @i < 10 — Will yield 10 * 500,000 rows

begin

insert into basetable

select top 500000 f.[name], l.[name], va.val, gm.val, s.code

from

( select top 500 [name] from sampDataFNames order by newid() ) as f,

( select top 500 [name] from sampDataLNames order by newid() ) as l,

( select 0 val union all select 1 val ) as va,

( select 0 val union all select 1 val ) as gm,

( select top 5 code from states order by newid() ) as s

order by newid()

set @i = @i + 1

end

This method will probably produce some duplicates, but in the context of what we are testing, it makes no difference. (The technique for random selection is derived from one posted on Pete Freitag's Web site and elsewhere.)

We should end up with a distribution where about 25% of the rows in the table have both "flags" set to 1. It's possible to verify that with a query like this:

Select

( select count(*) from basetable ) as Total,

( select ( select count(*) from baseTable where validAddr = 1 )* 100.0

/ ( select count(*) from basetable ) ) as PercentValidAddr,

( select ( select count(*) from baseTable where getsMail = 1 )* 100.0

/ ( select count(*) from basetable ) ) as PercentGetsMail,

( select ( select count(*) from baseTable where getsMail = 1 AND validAddr = 1 )* 100.0

/ ( select count(*) from basetable ) ) as PercentBoth

We have enough rows, at 5 million, to hit the 50% / 50% / 25% almost exactly, just through probability.

Measure Baseline Query Performance

The first step in optimizing this is to get a baseline for the query. For this, I use the Display Execution Plan option in Management Studio and

set statistics io on

GO

select fname, lname from baseTable where validAddr = 1 and getsMail = 1

As expected, this results in a Clustered Index Scan, execution plan cost = 25.4, reads = 21561. The database is examining the entire table to fish out the 25% that match our criteria. It seems like there is some extra work going on here that ought not to be required, but it's tricky to avoid that extra work.

Indexing

So what about a conventional index? We suspect it will not help, because it would not be selective enough, but it's important to really test that assumption:

create nonclustered index IX_Flags on baseTable( validAddr, getsMail )

GO

select fname, lname from baseTable where validAddr = 1 and getsMail = 1

There's no advantage here. SQL Server still scans the table, at the same cost; as we suspected, the index is not selective enough to be of any use, and we are asking for too many rows for it to be practical to look them up using an index.

drop index baseTable.IX_Flags

Next, it may be possible to create a covering index instead, ordered such that all the data for the records we are interested in is located together in one range. There is a challenge that makes this a less attractive option: the index will be essentially as large as the table itself, so we've doubled, or nearly doubled, our storage requirement. Ordered correctly, such an index would group the records we care about together, but we would be storing all of them when we are actually interested in only 25%. If we use a covering index, ideally we'd want to index only a subset of the rows rather than all of them. But that's not possible with a conventional index.

Indexed Views

The best solution would be to isolate the rows from the table that we want from those we do not — in a way, to "pre-select" the rows matching the criteria, before the query is even issued. What if we create a view containing the keys of only the rows we want, and index it, instead of the base table? This way we have a subset of the keys that match our criteria, persisted so it's always at the ready, which could be joined to the base table to produce the right set of results. It also seems that the view should be much smaller than a covering index, because it persists only some of the keys.

If you've never used an indexed view, it's an interesting construct: it is like a view, in that it contains a stored select statement that references data in base tables, but SQL Server will persist the data that the view returns, as if it were actually a table itself, instead of fetching the data from the base tables on demand. Any changes to the underlying data are automatically applied to the stored data composing the view. Indexed Views are often applied to "pre-resolve" complicated select statements that would otherwise take significant time to construct, especially those that are run frequently. See Books Online for more details, and be sure to read about the restrictions on indexed views related to connection settings and different editions of SQL Server.

create view keysOnly with schemabinding

as ( select akey

from dbo.baseTable

where validAddr = 1 and getsMail = 1 )

GO

create unique clustered index IX_keysOnly on keysOnly( aKey )

Using this setup there is good news and bad. The following query now has query plan cost 2.87 and reads 2019, which is a significant improvement over the baseline, because the keys are in effect already selected by being persisted in the view. But it fetches only the keys:

select aKey from keysOnly with( noexpand )

The problem is that we need the data from the base table, not just the keys. The real information we need is something like this:

select fname, lname from basetable t where validAddr = 1 and getsMail = 1

In order to get that using the indexed view, we need to join the view to the base table, which, it turns out, is expensive:

select fname, lname

from keysOnly v with( noexpand )

inner join basetable t on v.akey = t.akey

We now see a query plan cost = 36. This is because SQL Server has to scan / merge-join both objects, and performing the join is actually more expensive than simply scanning the base table. The reason is that the join itself demands a table scan to locate the matching rows. Any advantage we gain by storing the correct keys is cancelled by the work needed to fetch the rest of the data from the base table.

From this result, it's clear that the indexed view needs to contain the results for the query and not just the keys. So, I'm going to create an indexed view that essentially duplicates all the required data from the base table, for rows that match the criteria I need to select. This is something like a covering index, except that it stores only the matching rows, not all the rows, from the base table. I'm also going to include the State column from the base table, for additional tests below. The price for this, obviously, is the additional storage needed to persist the view:

create view completeSubset with schemabinding

as ( select akey, fname, lname, state

from dbo.baseTable

where validAddr = 1 and getsMail = 1 )

GO

create unique clustered index IX_completeSubset on completeSubset( aKey )

GO

select fname, lname from completeSubset with( noexpand )

This select statement now has query plan cost = 4.9 and reads = 4843, which is about 1/5 the original cost. We've hit on something that will make a meaningful improvement. Adding other criteria to the query still shows similar benefit. Let's look for all the people in the sample data with state = Montana:

Selecting from the base table:

select fname, lname, state

from dbo.baseTable

where validAddr = 1 and getsMail = 1

and state = 'MT'

Query plan cost: 21.57

And from the indexed view:

select fname, lname, state

from completeSubset with( noexpand )

where state = 'MT'

Query plan cost: 5.05

Impact on Updates and Inserts

This technique will have some cost in terms of changes to the base table, so it's important to estimate how much, and whether that impact will be detrimental. The sample data I created happened to include about 150,000 rows having state = "MT", so to at least get some idea of the update performance hit, I performed some updates on those rows and compared the execution plans. This is not a comprehensive test of inserts and updates, but just a brief check.

First, with the indexed view, a change to the base table has to also be affected in the view's data. For a statement like this one:

update baseTable set getsMail = 0 where state = 'MT'

An update has to be performed against both the base table and the view content, and each will show up in the execution plan as a node labeled as a "clustered index update." It might seem odd, but that is what is happening: the view data is persisted, and has its own clustered index, so that clustered index has to be updated along with the table itself. The cost in this case is 28.79.

Next, I'll remove the view, and perform a similar update:

drop view completeSubset

GO

update baseTable set getsMail = 1 where state = 'MT'

The cost to perform this update without the indexed view is 24.70, which is slightly less than the cost when the view is present, as one would expect. But this difference does not seem like a significant impact compared to performance gain we achieved on select queries. If the workload on this hypothetical system favors selects over changes — and many systems do by even a factor of ten — then the benefits are probably worth it. Let me roughly generalize the costs above into a guesstimate, understanding that this is an oversimplification: Suppose that the select statements run five times as often as updates on our system. Our test provided a four-fold improvement in selection, multiplied by five is a 20x weighted gain in selection performance, and a relatively small 0.86x loss in update performance.

Conclusion

Conventional indexes are not much help when the data they index is not selective. However, if you have adequate storage space and data cache available, and the hit in insert/update performance is not an overriding concern, then it is possible to get significant gains in select performance on this type of data by denormalizing into separate indexed views. This type of denormalization is relatively safe, because — unlike denormalized data in tables — the database engine itself maintains the consistency of the data. So there is little or no worry that the data in the indexed views will get out of sync with the base tables, and there is no additional effort or coding required to keep the view data up to date.

Good candidates for this technique include tables with large numbers of rows, where it's common to select from those rows based on columns that have low selectivity, but where the result sets are expected to be smaller than the base table itself. For example, if the selected set comprises 80% of the base table, then the benefit of this technique is not likely to be great, but if the selected set is only 25% of the table, then it will be significant. There is a trade-off in terms of the storage space required, but the space requirement is mitigated by storing only the matching rows from the base table, rather than all of them, as with a covering index. There is also some price in insert/update performance, but on many systems, the demands made by select queries far outweigh those made by inserts and updates.

Join a discussion about this article in our forum.

MS SQL Server

Subscribe

Powered By

Wednesday, October 1, 2008

Techniques for Indexing Low-Selectivity Columns in SQL Server

Create Sample Data

Measure Baseline Query Performance

Indexing

Indexed Views

Impact on Updates and Inserts

Conclusion

No comments:

Post a Comment

Recent Posts

Archives

Followers

Recent Comments