In the different projects where I perform a performance audit or am involved in performance tuning, I often find that while queries and stored procedures function correctly, they are not efficient in processing larger data sets. Many OLTP databases are set up as OLAP databases to handle single records instead of a record set as a whole.
In tuning a query with proper WHERE clauses, indexes, etc., you can often achieve a performance gain. However, when tuning the complete process, or handling record sets as a whole, the performance gain can be many times greater with this tuning.
Optimize for Changes or Selections
Tuning is a process of getting the optimal speed when working through the mutations using the least amount of recourses possible, while still keeping optimal performance in selections of data. Adding an index to a table will speed up selections, but slow down any mutations (inserts, updates, or deletes) in the table. The choice in defining the optimal balance between these two depends not only on the queries being executed on the database, but also on their frequency and priority.
If, for instance, all mutations are handled in off-hours, the tuning is often best if these can be completed within these hours, but the maximum performance is set up for the selections during office hours. During these hours, people are waiting for results of a selection.
Use of Resources
The performance is determined by the limitations of the available resources. The specific hardware available, type of resource needed to perform the requested query and the concurrent use of the server and its resources determine the time needed. Often one of the resources determines the major part of the query cost. When performing on the fly calculations, the processor is a key issue. When the amount of data increases, memory and disk I/O are a large influence. When tuning, the biggest gain can be reached by addressing these resources first. The query execution plan gives insight into the use of the resources.
Query Tuning
In query tuning, the main focus is to execute the existing queries with the best performance possible. By creating indexes, retrieving only the necessary columns and rows with correct where clauses, using indexed views, using pre-calculated values, spreading tables over multiple disks, etc., a given query's speed can be increased tremendously. However, there is a limit to the extent this can be achieved. After this, extra resources like more memory or faster disks can be added.
Process Tuning
By altering the complete process of handling the dataflow, the target is to use as few connections as possible, and limiting the number of query executions and complexity. This may require the data model to be altered. Because the process is always very specific to the given situation and often influences many aspects of the database, there are no general guidelines to lead through this tuning. However, by identifying the area where the largest amount of time or resources are used in handling the data, a critical look at the current situation can lead to new methods. Most of the time, there is more than one method of handling the dataflow.
Below are some illustrative examples.
Connections and Executions
Making a connection to a database takes time. Executing even a simple query takes time to compile and execute. This overhead is partly dependant on the table and its content, but always takes some time. For instance, the following code creates a table with one field. Without inserting any data, I query the table repeatedly and note the time:
CREATE TABLE tbl_Test (TestID CHAR(1))
DECLARE @Time1 DATETIME, @Time2 DATETIME, @i INT
SELECT @Time1 = GETDATE(), @i=0
AGAIN:
SELECT * FROM tbl_Test
SELECT @Time2 = GETDATE(), @i=@i+1
PRINT 'TIME IS ' + CONVERT(CHAR, @Time2, 14) + ', i = ' + CONVERT(CHAR, @i) + ', TIMEDIFF = ' + CONVERT (CHAR, DATEDIFF(ms, @Time1, @Time2))
SELECT * FROM tbl_Test
SELECT @Time1 = GETDATE(), @i=@i+1
PRINT 'TIME IS ' + CONVERT(CHAR, @Time2, 14) + ', i = ' + CONVERT(CHAR, @i) + ', TIMEDIFF = ' + CONVERT (CHAR, DATEDIFF(ms, @Time2, @Time1))
IF @i <>
This will produce 1000 empty selections. The messages tell me the time difference between the previous selection and the current one. The first 160 or so selections are executed within the same millisecond. However, even in this small selection, after 160 selections there is some overhead that can be measured.
TIME IS 14:54:29:430, i = 158, TIMEDIFF = 0
(0 row(s) affected)
TIME IS 14:54:29:430, i = 159, TIMEDIFF = 0
(0 row(s) affected)
TIME IS 14:54:29:430, i = 160, TIMEDIFF = 16
(0 row(s) affected)
TIME IS 14:54:29:447, i = 161, TIMEDIFF = 0
(0 row(s) affected)
TIME IS 14:54:29:447, i = 162, TIMEDIFF = 0
As the table definitions grow more complex and the number of records in them increases, this will occur faster and with greater time loss.
Record Set Size
The selection speed of different record sets is not linear to the number of rows. Because many steps have to be taken for any selections, getting extra records out of the database often hardly takes any more time. In a typical database, I have about 17 million records in a table. By making a selection of 20,000, 50,000, 100,000, and 150,000 records, I calculated the execution time per record. These are some of the results:
Rows | Rows / Second |
---|---|
20,000 | 476 |
51,987 | 456 |
20,000 | 377 |
51,987 | 702 |
50,000 | 704 |
133,276 | 1,293 |
50,000 | 694 |
133,276 | 1,211 |
100,000 | 1,369 |
282,818 | 2,643 |
100,000 | 1,388 |
282,818 | 2,525 |
150,000 | 2,027 |
421,581 | 3,798 |
150,000 | 2,027 |
421,581 | 3,603 |
20,000 | 408 |
51,987 | 577 |
20,000 | 400 |
51,987 | 742 |
50,000 | 735 |
133,276 | 1,402 |
50,000 | 735 |
133,276 | 1,373 |
100,000 | 1,449 |
282,818 | 2,525 |
100,000 | 1,470 |
282,818 | 2,459 |
150,000 | 2173 |
421,581 | 4,093 |
150,000 | 2,142 |
421,581 | 4,053 |
This test indicates that one selection of 100,000 records is about three times as fast as four selections of 20,000 records each. So if possible, get all the information you need in one selection instead of going back to the database many times.
Cursor
A well-designed relational OLAP database gets all mutations via source files. In a file, a complete record for an entity is given in a single line. The definition of the records is as follows:
First 50 characters identify the file and its origin with fixed field lengths.
After this, one or more categories are listed. These correspond to one or more tables in our database. The first four characters identify the category. The next three characters identify the length of the category content. After this, the next category starts.
Within a category, one or more elements are listed. These correspond to fields in the database. The first four characters identify the element, the next three characters identify the length of the content.
Because the number of categories, as well as the number of elements varies, and they have to be linked to a single entity, the chosen method was to parse the file in a .NET application to split it into a relational model for each record. From here, for each record an INSERT with VALUES was given for each table in the database. Loading files of a million records (as was common) with an average of nine tables leads to 9 million connections / executions to the database.
I set up a file import table, where the complete record was bulk loaded into the database. I added an identity field to uniquely identify a record / entity. From here, I parsed the identifying fields for the file, wrote them to a separate table, and removed this part of the record.
Next, I inserted all first categories into a separate table, removed this category from the body, and repeated the step until all categories were split. I repeated this step for each element in a category.
I used this code for it:
-- CAPTURE RECORD INFO
INSERT INTO dbo.tbl_RecordInfo (RecordID, IDField, Status, ...)
SELECT RecordID, SUBSTRING (RecordBody, 1, 10), SUBSTRING (RecordBody, 11, 1), ....
FROM dbo.tbl_RecordImport
--REMOVE RECORD INFO FROM BODY
UPDATE dbo.tbl_RecordImport
SET RecordBody = SUBSTRING(RecordBody, 51, LEN(RecordBody) - 50)
SET @CategoryCount = 0
CATEGORY_LOOP:
SET @CategoryCount = @CategoryCount + 1
--INSERT CATEGORY
INSERT INTO dbo.tbl_RecordCategory (RecordID, SortOrder, CatNumber, CatBody)
SELECT RecordID, @CategoryCount, SUBSTRING (RecordBody, 1, 2), SUBSTRING (RecordBody, 6, CONVERT(INT, SUBSTRING(RecordBody, 3, 3)))
FROM dbo.tbl_RecordImport
--REMOVE ALL RECORDS OF WICH ALL CATEGORIES ARE HANDLED
DELETE FROM dbo.tbl_RecordImport
WHERE (LEN(RecordBody)= CONVERT(INT, SUBSTRING(RecordBody, 3, 3)) + 10)
--REMOVE HANDLED CATEGORY FROM BODY
UPDATE dbo.tbl_RecordImport
SET RecordBody = SUBSTRING(RecordBody, CONVERT(INT, SUBSTRING(RecordBody, 3, 3)) + 6, LEN(RecordBody) - CONVERT(INT, SUBSTRING(RecordBody, 3, 3)) + 5)
SET @Rows = @@ROWCOUNT
--IF NOT ALL CATEGORIES ARE HANDLED, LOOP AGAIN
IF @Rows > 0 GOTO CATEGORY_LOOP
ELEMENT_LOOP:
--INSERT ELEMENT
INSERT INTO dbo.tbl_Element (CatID, ElementNumber, ElementContent)
SELECT CatID, SUBSTRING (CatBody, 1, 4), SUBSTRING (CatBody, 8, CONVERT(INT, SUBSTRING (CatBody, 5, 3)))
FROM dbo.tbl_RecordCategory
WHERE CatBody IS NOT NULL
--REMOVE BODY IF ALL ELEMENTS ARE HANDLED
UPDATE dbo.tbl_RecordCategory
SET CatBody = NULL
WHERE LEN(CatBody) = CONVERT(INT, SUBSTRING(CatBody, 5, 3)) + 7
--REMOVE ELEMENT FROM BODY
UPDATE dbo.tbl_RecordCategory
SET CatBody = SUBSTRING (CatBody, CONVERT(INT, SUBSTRING(CatBody, 5, 3)) + 8, LEN(CatBody) - CONVERT(INT, SUBSTRING(CatBody, 5, 3)) + 7 )
WHERE CatInhoud IS NOT NULL
SET @Rows = @@ROWCOUNT
--IF NOT ALL ELEMENTS ARE HANDLED, LOOP AGAIN
IF @Rows > 0 GOTO ELEMENT _LOOP
This code splits the elements into readable pieces for the database. In order to populate the tables in the database, the following view gives the results in a table-like record set:
SELECT
C.BerichtID AS 'BerichtID',
C.Nummer AS 'Categorie',
C.Volgorde AS 'Volgorde',
MAX(CASE E.ElementNumber WHEN '0110' THEN E.ElementContent ELSE NULL END) AS 'E0110',
MAX(CASE E.ElementNumber WHEN '0120' THEN E.ElementContent ELSE NULL END) AS 'E0120',
MAX(CASE E.ElementNumber WHEN '0130' THEN E.ElementContent ELSE NULL END) AS 'E0130',
FROM dbo.tbl_RecordCategory C
LEFT JOIN dbo.tbl_Element E ON C.CatID = E.CatID
WHERE C.CatNumber = '01'
GROUP BY C.CatNumber, C.SortOrder, C.RecordID
The whole process loops through all categories and each element in them. This is about 20 loops each, hence 40 database executions. A file of 100,000 records is completely handled in the database in about two minutes. The alternative, handling each record separately takes close to an hour. By changing the process, the performance is 25 times faster.
Conclusion
As illustrated, there can be a great performance boost by altering the data-process. There is often more than one way to insert or manipulate the data in an OLAP database. By trying more than one method, insight can be gained on the options, flexibility, and processing speed. When working with large datasets, the objective is to handle as many records as possible in one set. This may result in a significant performance gain.
About the Author
Nils Bevaart has been a DBA for over seven years and has had hands-on experience in server management, database design, data warehousing, and performance tuning. Starting in 2006, he set up a company specializing in data processing, covering issues of effective maintenance, dataflow, database architecture, consultancy, and training.
No comments:
Post a Comment