This historical view on the progression in processor performance over time is presented as a guide to upgrading older server systems and as an acquisition strategy for new systems. The focus is on the Intel IA-32 line because Intel designs processors to Moore's Law, and a common set of performance results are available for this processor line over an extended period. Some discussion is also given on Opteron and Itanium characteristics relevant to SQL Server performance.
It has been over six years since the introduction of the Pentium III processor on the 180nm process. Since that time, the Pentium III line was succeeded in desktop and server systems by the Pentium 4 line and in mobile systems by the Pentium M line. Both the Pentium 4 and Pentium M lines are now being succeeded by the Core 2 line which is mostly a descendant of the Pentium M with a few genes contributed from the Pentium 4. The 180nm process (or 0.18µm) has been succeeded by three process generations, 130nm, 90nm, and now 65nm.
The implication of Moore's Law is that for server systems with fast load growth, on the order of 40% year-to-year, a frequent replacement cycle, two years, is more effective than a four-five year replacement cycle. This is counter to common accounting practices that seem to think computer systems should have a five year life cycle. So it is important for the DBA to present the case to management decision makers that technology practices should be driven by technology developments and not arbitrary accounting rules.
Intel IA-32 Processors from Coppermine to Merom/Conroe/Woodcrest
Moore's Law originally stated that processor performance can be doubled every two years. This was adjusted to 18 months in the early years to account for progress in areas not originally considered. But the original rate is probably more accurate in recent history and the near future.
The basis for Moore's Law is as follows. Every two years, a new manufacturing process is available. From one process generation to the next, linear dimensions are reduced by a factor of 0.72, for an area reduction of approximately one-half (0.72 x 0.72). The goal of a process shrink is to increase transistor switching speed by 1.3x. With other layout improvements, it should be possible to increase the frequency of a processor architecture designed on the previous process by 50%, for a net performance gain of approximately 40%. It should also be possible to design a new microprocessor architecture on the new process with twice as many (logic) transistors as the original. The new processor architecture should be approximately 40% faster than the old architecture on the same manufacturing process, hence the original Moore's Law.
To properly distinguish between architecture and manufacturing process generations for Intel processors, it is more convenient to use the code names, rather than the official names (see http://www.sandpile.org/ for translation). Official names are driven by marketing strategy where a single name can apply to more than one processor core. Table 1 shows the desktop and mobile cores by process, official product name, code name, top frequency, launch date, and cache size. Table 2 shows codenames for some Xeon, Xeon MP, and other processors.
The Pentium III processor on the 180nm process launched in November 1999 has the Coppermine core. The Pentium 4, with the Willamette core, was a new micro-architecture radically different from previous generations, also introduced on the 180nm process in November 2000. Both the Pentium III and Pentium 4 processors were continued to the 130nm process starting in mid-2001 for the Tualatin core and in January 2002 for Northwood. Both lines featured 256 KB on-die L2 cache at 180nm and 512 KB on 130nm. The Pentium III brand and architecture ended with the 130nm process.
Process | Processor | Core | Top Freq. | Launch | Notes |
180nm | Pentium III | Coppermine | 1.0 GHz | Nov. 1999 | 256 KB L2 |
180nm | Pentium 4 | Willamette | 2.0 GHz | Nov. 2000 | 256 KB L2 |
130nm | Pentium III | Tualatin | 1.4 GHz | July 2001 | 512 KB L2 |
130nm | Pentium 4 | Northwood | 3.4 GHz | Jan. 2002 | 512 KB L2 |
130nm | Pentium M | Banias | 1.7 GHz | March 2003 | 1 MB L2 |
130nm | Pentium 4 EE | Gallatin | 3.4 GHz | Feb. 2004 | 2 MB L3 |
90nm | Pentium 4 | Prescott | 3.8 GHz | Feb. 2004 | 1 MB L2 |
90nm | Pentium M | Dothan | 2.26 GHz | May 2005 | 2 MB L2 |
90nm | Pentium 4 | Irwindale | 3.8 GHz | Feb. 2005 | 2 MB L2 |
65nm | Pentium 4 | Cedar Mill | 3.8 GHz | Jan. 2006 | 2 MB L2 |
65nm | Core Duo | Yonah | 2.16 GHz | Jan. 2006 | 2 MB L2 |
65nm | Core 2 | Woodcrest | 3.0 GHz | June 2006 | 4 MB L2 |
Table 1: Intel processors with codenames, launch date and top frequency.
Process | Processor | Core | Notes |
180nm | Pentium III Xeon | Cascades | 2 MB L2 |
180nm | Xeon MP | Foster | 1 MB L3 |
130nm | Xeon MP | Gallatin | 2 MB/4 MB L3 |
90nm | Pentium D | Smithfield | 1 MB L2 Dual Core |
90nm | Xeon | Nocona | 1 MB L2 |
90nm | Pentium D | Paxville | 2 MB L2 |
90nm | Xeon MP | Cranford | 1 MB L2 |
90nm | Xeon MP | Potomac | 1 MB L2 + 8 MB L3 |
90nm | Xeon 7040 | Irwindale? | 2 MB L2 Dual Core |
65nm | Pentium D | Presler | 2 MB L2 Dual Core |
65nm | Xeon 50x0 | Dempsey | 2 MB L2 Dual Core |
65nm | Xeon MP | Tulsa | 2x1 MB L2 + shared 16 MB L3 |
Table 2: Additional codenames for derivative processors.
The Willamette/Northwood micro-architecture also ended with the 130nm process, but the Pentium 4 brand was continued with a new architecture derived from the original. The Prescott core launched in February 2004 on the 90nm process with a 1 MB L2 cache. A year later, this was followed the 90nm Irwindale core, having the same architecture as Prescott with a bigger 2 MB L2 cache. The final process for the Prescott line is 65nm with the Cedar Mill core for desktops and Dempsey for two socket servers, and Tulsa for four socket servers. Variants of the Prescott core featured two single core processor dies in one package, with brand names Pentium D or Pentium Extreme Edition. This offers essentially the same capability as one die with two independent processor cores with no shared resources or special inter-processor communication capabilities.
The first Pentium 4 architecture, Willamette core, implemented 20+ pipeline stages to achieve a much higher frequency than the Pentium III architecture (12+ pipeline stages) on the same process. The second Pentium 4 architecture with the Prescott core continued this strategy to 30+ pipeline stages presumably with the intent to further accelerate the pace of frequency increase.
It was realized the Willamette architecture did not address the needs of the mobile market, which needed the best performance to power ratio and good but not necessarily top performance. For this purpose the Pentium M line was developed starting with the Banias core (1 MB L2 cache) on 130nm in March 2003, continued to the 90nm Dothan core, 2 MB L2 in May 2004, and then to the 65nm dual-core Yonah in January 2006. With each process shrink, the processor core received minor enhancements. A major change with Yonah was the shared L2 cache. It is simpler to implement a dedicated L2 cache for each core. However, the theory is that a single shared 2 MB L2 cache is better than two separate 1 MB L2 caches, one for each core, if implemented properly.
Intel has just introduced the first of the Core 2 with codenames Merom, Conroe, and Woodcrest for mobile, desktop, and server systems, consolidating the Pentium 4/Pentium D and Pentium M product lines. Woodcrest launched first, and the published performance results are for Woodcrest. This codename will be used, but all three refer to the same processor core. (See David Kanter's article on http://www.realworldtech.com/ on core micro-architecture performance.)
SPEC CPU 2000 Integer Performance Comparisons
The SPEC CPU 2000 Integer suite (http://www.spec.org/) is a reasonable benchmark for server applications. It averages 12 individual applications that cover a broad range of code patterns, making it difficult to generate disproportionate performance gains using special features that do not have broad applicability. A database benchmark is more relevant for SQL Server based applications. For this, the TPC-C benchmark is appropriate, but there is not as broad a range of results available as for SPEC CPU. So it is helpful to examine results for both benchmarks, each of which exposes certain differences not evident in the other benchmark.
Figure 1 shows SPEC CPU 2000 Integer base performance for representative IA-32 processors from 2000 to 2006. The selected processors span three architecture and three process generation transitions. The architecture transitions are from Pentium III to the first Pentium 4 (Willamette) to the second Pentium 4 (Prescott) and finally to Core 2 (Woodcrest). The process transitions are from 180nm (Coppermine and Willamette) to 130nm (Tualatin and Northwood) to 90nm (Prescott/Irwindale) to 65nm (Cedar Mill and Woodcrest). The range constitutes three doublings of performance by Moore's Law. Each process generation contributes one-half and each architecture generation contributes one-half, for a target gain of 8x. By this metric, it might seem that Woodcrest is slightly under performing the 8x Pentium III, 1.0 GHz target 3,600.
Figure 1: SPEC CPU 2000 Integer performance for Intel IA-32 line from 2000-2006.
The primary explanation is that SPEC CPU 2000 Integer is a single threaded benchmark while Woodcrest relies on dual-cores to meet the performance progression objective. TPC-C on the other hand is a multi-threaded benchmark and will properly show the benefit from multi-core processors. The Willamette to Northwood transition significantly exceeded the expected 1.4x gain, the details of which will be discussed later.
Figure 2 shows performance on the 12 individual applications that make up the SPEC CPU 2000 Integer benchmark (details on the SPEC Web site). Regardless of the overall composite score, individual applications can have widely varying results. Woodcrest shows strong gains in all areas. This is a very difficult achievement and a good indicator that a broad range of applications will realize performance gains with Woodcrest.
Figure 2: SPEC CPU 2000 Integer performance by application.
Figure 3 shows the performance gain for each of the SPEC CPU 2000 Integer component applications by processor relative to the 180nm Pentium III 1.0 GHz/256 KB cache. The relative gain for Woodcrest on mcf is 15.41, which is off the scale.
Figure 3: Performance gain by application relative to Pentium III 1.0 GHz/256 KB.
Figure 4 shows the SPEC CPU 2000 Integer 12 component applications for the Pentium III and Pentium 4 processors on 180nm. On the same compiler (Intel C/C++ 5.0), the Pentium 4 shows a 41% gain over the Pentium III at the top frequency for each processor on 180nm. On the newer Intel C/C++ 6.0 compiler, the Pentium 4 shows a 50% gain over Pentium III. In the early days, outside contributions were sufficient to warrant revising Moore's Law to 2x in 18 months, but with maturity, the original projection is probably more appropriate. The Pentium 4 processor requires a compiler aware of its internal architecture to take full advantage of its capabilities. Some applications not recompiled and tuned for the Pentium 4 did not show much gain.
Figure 4: SPEC CPU 2000 Integer by application for Pentium III and Pentium 4 on 180nm.
Figure 5 shows a comparison of Pentium III on 130nm against Pentium 4 on 180nm. The process shrink of the Pentium III from 180 to 130nm shows a comparable gain to the advance of one architecture generation on the same process (Pentium III to Pentium 4 on 180nm) on average. Each of the two foundations of Moore's Law makes approximately equal contribution when averaged over multiple applications, even though there are variations by application.
Figure 5: SPEC CPU 2000 Integer for Pentium III on 130nm and Pentium 4 on 180nm.
Figure 6 shows the SPEC CPU 2000 Integer base performance for a range of the 130nm Pentium 4 processors from 2.0 GHz to 3.4 GHz with reference to Willamette at 2.0 GHz.
Figure 6: SPEC CPU 2000 Integer for Pentium 4 on 180 and 130nm.
The shrink of the Pentium 4 from 180nm to 130nm Northwood core netted nearly a 2x gain, far above the 1.4x normally expected. The gain derives from a combination of faster core speed, larger cache, improved compilers, and higher front-side bus bandwidth. Northwood launched in January 2002 quickly reaching 3.06 GHz in November 2002 with a CPU 2000 integer result 1,099 on the Intel C/C++ version 6 compiler.
At this point in mid-2002, Intel would normally have introduced a new architecture on the 130nm process to carry on the performance progression, and discontinued attempts to further tweak the Northwood core. However, Intel's strategy of pursuing separate processor architecture lines for desktop and mobile platforms meant that a new processor design for desktops was scheduled for the 90nm process in the late 2003 to early 2004 time frame instead of the 130nm process in mid-2002. Until the next architecture was ready, Intel managed to tweak the Northwood core for two additional speed bins to 3.4 GHz through early 2004. The MP server derivative of Northwood, Gallatin with a 2 MB L3 cache in addition to the 512 KB L2, introduced as Pentium 4 Extreme Edition, achieved SPEC CPU 2000 integer base result 1,701.
The normal Intel schedule would have had the 90nm process ready in late 2003, preferably shrinking an existing 130nm design to better guarantee intercepting the process availability point. In fact, no 90nm processor was ready until early 2004. It is unclear whether this was because no design was ready or the extra time was used to resolve unexpected issues with the 90nm process. The first 90nm processor, Prescott, was a new architecture, also unusual for the Intel pattern.
In a process shrink, it is normally possible to reduce transistor power consumption. This allows both higher frequency operation and more transistors in a general power range. However, the 90nm process had higher leakage current than expected. The result was that the Prescott core only reached 3.8 GHz due to thermal limitations, even though transistor switching speed would have supported much higher frequency operation. Figure 7 shows Northwood (130nm 512 KB L2), Gallatin (130nm 2 MB L3), Prescott (90nm 1 MB L2) and Irwindale (90nm 2 MB L2) component performance.
Figure 7: SPEC CPU 2000 Integer for Pentium 4 on 130nm and 90nm.
The Prescott core achieved a SPEC CPU 2000 integer base result of 1,666 at 3.8 GHz, only 24% over Northwood at 3.4 GHz and slightly below Gallatin. The Irwindale core at 3.8 GHz with 2 MB L2 cache was able to reach 1,833 for a 36% gain over Northwood but only 8% over Gallatin. Prescott under-performed Northwood in gzip, and showed only minor gains in crafty.
Since Prescott encompassed both a new architecture and a process shrink, this was well short of the true goal of doubling Northwood performance. It is possible to estimate its design goals for the 90nm Prescott core had it not been thermally limited. A simple shrink of the Northwood core to 90nm is expected to yield a 30% frequency increase. A full compaction should yield a 50% gain. The increase in pipeline stages from 20 in Willamette/Northwood to 30 in Prescott was probably intended to increase frequency by 50% on the same process. So there is reason to expect that the true goal of Prescott was to nearly double Northwood frequency to the neighborhood of 6 GHz at 90nm and close in on 10 GHz at 65nm, had leakage current not been an issue.
The Woodcrest core is derived mostly from the Pentium M, so it is helpful to review the Pentium M processors, Banias, Dothan, and Yonah (under the Core Duo brand). The Banias core has been described as a completely new design by some and as an improved Pentium III by others. Both Pentium II and Pentium III processors represent minor improvements to the Pentium Pro architecture, adding MMX instructions in Pentium II, SSE instructions in the Katmai core Pentium III and a significantly improved on-die L2 cache with Coppermine. There were no significant changes to the core architecture.
It is possible that Banias retained the core architecture of Pentium Pro but made significant design improvements in both performance and power efficiency. Intel documents describe performance improvements in Banias as advanced branch prediction, micro-ops fusion (decoded x86 instructions paired into single op), dedicated stack engine, and the 4x bus from Pentium 4. Dothan added: Enhanced Register Access Manager, Intelligent branch prediction – Advanced Tight Loop Execution, dual channel DDR2-533 compared with single DDR-333 for Banias. Yonah improvements: dual core shared L2 cache, SSE, integer division and the H/W pre-fetcher.
Figure 8 shows the component performance for the Pentium M 1.6 GHz with 2 MB cache on 90nm relative to Pentium III 1.4 GHz/512 KB on 130nm. Unfortunately there was not a result listed for the 130nm Pentium M, 1 MB L2 cache. Some of the performance gain is due to frequency (1.4 GHz to 1.6 GHz), cache size (512 KB to 2 MB), compiler (Intel C/C++ version 5.01 to version 9.0), memory subsystem (single SDRAM 133 MHz to dual DDR2 533 MHz) and the remainder from architectural differences between the Pentium III and Pentium M.
Figure 8: Pentium M 1.6 GHz/2 MB performance relative to Pentium III 1.4 GHz/512 KB.
The component applications vpr, gcc, mcf, bzip2, and twolf are highly sensitive to cache size, but it is clear the not all of the gains can be attributed to frequency, cache, compiler, and memory improvements. There are definitely substantial performance gains due to improvements in the design or architecture. Intel documents show 65% gain for mcf from Banias to Dothan at the same frequency.
For some curious reason, there does not appear to be any public Intel documents detailing the number pipeline stages in Banias. There is reason to believe it may have 12-14 pipeline stages, similar to the Pentium III. The top 130nm Banias frequency was 1.7 GHz. A full compaction of Coppermine to 130nm should have yielded 1.5 GHz. It is possible that Tualatin was either an assisted-shrink or that top frequency was not a pressing goal. So 1.7 GHz for a new 130nm design with 12-14 pipeline stages is very reasonable. Note that the Banias 1.7 GHz operated at 1.484v while Northwood required 1.525v to reach 3.4 GHz. The design team called Pentium M a new design instead of an improved Pentium III. It is at the least a significant improvement over the Pentium III from the performance perspective, more than enough to constituent one generation.
The 90nm Dothan only reached 2.26 GHz, but this is probably limited by the power envelope for mobile platforms rather than the true limit of the processor core. Of the two 90nm processors, Dothan 2.26 GHz operated at 1.34v while Prescott operated at 1.425v. Dothan might have reached over 2.5 GHz if not restricted by the 27w power envelope. The shrink of Dothan to 65nm might have reached as high as 3.7 GHz based on a 1.5x gain. The top actual Yonah frequency is 2.16 GHz, slightly lower than Dothan, probably to accommodate a 31w power envelope for dual cores at 1.3v, compared to 1.40v for Cedar Mill desktop processor.
On http://www.extremetech.com/, Conroe is described as a 14-stage pipeline. It is unclear whether this was inherited from the Banias/Dothan/Yonah line or a new change. Conroe is four-wide, meaning four instructions can be issued to each clock and four can be retired on each clock. Other enhancements include macro-op fusion which pairs certain x86 instructions into a single micro-op. Figure 9 shows the component performance of Pentium M 2.26 GHz/2 MB to Woodcrest 2.33 GHz/4 MB.
Figure 9: SPEC CPU 2000 Integer for Pentium M on 90nm and Core 2 on 65nm.
Figure 10 shows the performance of Woodcrest relative to Dothan at nearly the same frequency. There is a 3% increase from 2.26 GHz and 2.33 GHz and a 34% increase in performance. Some can be attributed to cache, memory bandwidth, and compiler, but the bulk of the gain is from architectural improvements. Considering that processor core probably did not double the number of logic transistors, this is a very impressive performance gain.
Figure 10: Core 2 2.33 GHz performance relative to Pentium M 2.26 GHz.
Figure 11 compares the range of available Woodcrest processors (Xeon 51x0) to Dempsey (Xeon 5080) 3.73 GHz. Even the 1.86 GHz Woodcrest has better overall performance than the 3.73 GHz Dempsey core. Considering that Prescott performance was limited, the Woodcrest 3.0 GHz SPEC CPU 2000 integer base performance of 3,057 should not be unexpected.
Figure 11a: Woodcrest performance compared to Dempsey.
There is not normally such a large performance jump from the top frequency of one processor line to the top launch frequency of the next line. Historically, Intel has achieved a 30% increase in frequency from launch to process maturity when thermal limitations were not present. Figure 11b compares the SPEC CPU integer component performance for Woodcrest at 1.86 GHz and 2.0 GHz to Xeon 5080 3.73 GHz. There is a very large gain in mcf, contributing to the high overall scores for Woodcrest. A few components of the 1.86 GHz Woodcrest are below the 3.73 GHz Dempsey. All components of the 2.0 GHz Woodcrest are better or essentially equal to the 3.73 GHz Dempsey.
Figure 11b: Woodcrest component performance compared to Dempsey.
In the SPEC CPU 2000 Integer benchmark, the performance progression goal of Moore's Law has been successfully achieved up to the current generation Woodcrest. Going forward, it may be difficult to continue this on a single-threaded benchmark. Look for developments to enable employing multi-cores on current single threaded applications. Technically, there were only two architecture generation transitions between Pentium III and Core 2 via Pentium M. The performance gains however are comparable to three generations.
SQL Server 2000 and 2005 TPC-C Performance
As useful as the SPEC CPU integer benchmarks are, database benchmarks are still required for a valid multi-threaded and multi-processor performance assessment. The most well known are TPC-C and TPC-H (http://www.tpc.org/) for transaction processing and decision support, respectively. The TPC benchmarks are applications running on top of the SQL Server or other database engines. The configurations are often at maximum memory and disk I/O so deficiencies in the system architecture can be exposed. For multi-processor systems, this is a demonstration of the ability to scale up. Both TPC-C and TPC-H are useful, but there are more results available on the TPC-C transaction processing benchmark for meaningful comparisons.
Some characteristics of the TPC-C benchmark are as follows. The rules for a valid result are that there can be a maximum of 12.5 tpm-C per warehouse. Each warehouse constitutes approximately 84 MB of data. A 125 KB tpm-C result requires a minimum of 10 KB warehouses or 840 GB data. A smaller data size will have lower disk I/O. The objective is to employ the smallest number of warehouses to meet the reporting requirement for transaction performance. Comparing TPC-C results is a tricky matter because the benchmark is I/O intensive. The memory to data size ratio can significantly influence disk I/O activity and in turn performance. The comparison is most clear when the tpm-C performance to memory ratio and the tpm-C per disk are reasonably close. The most common objective in the TPC-C benchmark is to achieve the highest score for each system. The price/performance must also be reported with each result. Some results are oriented to this aspect. The total system cost can be impacted by the choice of components, especially storage, that have no bearing on performance. So rather than comparing price/performance, the tpm-C per disk is noted below.
Table 3 shows selected TPC-C results for two socket systems. Table 4 shows selected TPC-C results for four socket systems. The results span both SQL Server 2000 and SQL Server 2005. SQL Server 2000 was already very efficient at transaction processing. It is unclear whether SQL Server 2005 made any gains for two or four socket systems. It is possible the SQL Server 2005 64-bit results benefit from more efficient use of memory, not having to use AWE to access more than 4 GB memory. SQL Server 2005 does have NUMA aware features that contribute to transaction processing performance most noticeable in systems with 16 or more processors.
Processor | Freq. GHz | Cache | Mem. GB | No. of Disks | tpm-C | tpm-C /GB | tpm-C /Disk | Report Date |
Pentium III | 1.0 | 256 KB | 4 | 120 | 17,336 | 4,334 | 144 | 09/26/01 |
Pentium III | 1.26 | 512 KB | 4 | 104 | 22,007 | 5,502 | 212 | 10/25/01 |
Xeon | 2.2 | 512 KB | 8 | 143 | 33,768 | 4,221 | 236 | 02/25/02 |
Xeon | 3.06 | 512 KB | 12 | 214 | 44,942 | 3,745 | 210 | 05/29/03 |
Xeon | 3.06 | 1 MB L3 | 12 | 214 | 52,468 | 4,372 | 245 | 07/15/03 |
Xeon | 3.2 | 1 MB L3 | 12 | 256 | 54,096 | 4,508 | 211 | 10/13/03 |
Xeon | 3.2 | 2 MB L3 | 12 | 290 | 60,364 | 5,030 | 208 | 03/02/04 |
Xeon | 3.6 | 1 MB L2 | 16 | 104 | 68,010 | 4,251 | 654 | 11/01/04 |
Xeon | 3.6 | 2 MB L3 | 16 | 203 | 74,298 | 4,644 | 366 | 02/11/05 |
Opteron | 2.6 | 1 MB | 16 | 174 | 71,413 | 4,463 | 410 | 02/14/05 |
Opteron | 2.8 | 1 MB | 32 | 296 | 76,214 | 2,382 | 257 | 09/30/05 |
Opteron DC | 2.4 | 2x1 MB | 32 | 296 | 109,633 | 3,426 | 370 | 11/08/05 |
Opteron DC | 2.6 | 2x1 MB | 32 | 380 | 113,628 | 3,551 | 299 | 03/07/06 |
Xeon | 3.73 | 2x2 MB | 32 | 518 | 125,954 | 3,936 | 243 | 05/01/06 |
5160 DC | 3.0 | 1x4 MB | 32 | 520 | 140,264 | 4,383 | 270 | 06/26/06 |
5160 DC | 3.0 | 1x4 MB | 64 | 590 | 169,360 | 2,646 | 285 | 05/22/06 |
Table 3: Selected TPC-C results for two socket systems.
Processor | Freq. GHz | Cache | Mem. GB | No. of Disks | tpm-C | tpm-C /GB | tpm-C /Disk | Report Date |
PIII Xeon | 0.9 | 2 MB | 8 | 182 | 39,158 | 4,895 | 215 | 09/27/01 |
Xeon MP | 1.6 | 1 MB | 8 | 238 | 48,911 | 6,114 | 206 | 05/17/02 |
Xeon MP | 1.6 | 1 MB | 32 | 235 | 61,564 | 1,924 | 262 | 08/23/02 |
Xeon MP | 2.0 | 2 MB | 32 | 292 | 78,116 | 2,441 | 268 | 04/21/03 |
Xeon MP | 2.8 | 2 MB | 32 | 296 | 84,595 | 2,644 | 286 | 06/30/03 |
Xeon MP | 3.0 | 4 MB | 32 | 240 | 95,163 | 2,974 | 397 | 03/01/04 |
Xeon MP | 3.0 | 4 MB | 32 | 266 | 102,667 | 3,208 | 386 | 03/01/04 |
Xeon MP | 3.66 | 1 MB | 64 | 434 | 141,504 | 2,211 | 326 | 04/21/05 |
Xeon 7041 | 3.0 | 2x2 MB | 64 | 458 | 188,761 | 2,949 | 412 | 10/28/05 |
Opteron | 2.4 | 1 MB | 32 | 303 | 115,110 | 3,597 | 380 | 10/15/04 |
Opteron | 2.6 | 1 MB | 64 | 403 | 130,623 | 2,041 | 324 | 02/14/05 |
Opteron | 2.8 | 1 MB | 64 | 407 | 138,845 | 2,169 | 341 | 09/30/05 |
Opteron DC | 2.2 | 2x1 MB | 64 | 408 | 187,296 | 2,927 | 459 | 04/21/05 |
Opteron DC | 2.4 | 2x1 MB | 128 | 406 | 206,181 | 1,611 | 508 | 11/04/05 |
Opteron DC | 2.6 | 2x1 MB | 128 | 406 | 213,986 | 1,672 | 527 | 03/20/06 |
Tulsa | 3.73 | 16 MB | ? | ? | 320 KB? | ? | ? | ? |
Table 4: Selected TPC-C results for four socket systems.
Figure 13 below shows TPC-C performance for dual socket Intel systems from the Pentium III 1.0 GHz to the new Dual-Core Intel 5160 (Woodcrest) at 3.0 GHz. The Pentium III 1.0 GHz/256 KB posted a result of 17,335 tpm-C. The Pentium III 1.26 GHz/512 KB reached 22,007. No results were posted for two-way Xeon systems with the 180nm 256 KB cache Willamette core. The next step was a 130nm Xeon 2.2 GHz/512 KB result of 33,768. From here, Northwood based systems (512 KB L2, no L3) progressed to 44 KB at 3.06 GHz. Then the Gallatin core with 1 MB or 2 MB L3 cache in addition to the 512 KB L2 cache was made available in two-way Xeon systems, having been previously available only in four-way Xeon MP systems, and reached 60 KB. The Prescott derived core with 2 MB cache at 3.6 GHz reached 74 KB. The 90nm Xeon actually reached 3.8 GHz, but there are no published TPC-C results. Beyond this point, the 90nm Prescott/Irwindale core was not advanced further because of thermal limitations.
Figure 12: TPC-C performance for dual processor Intel systems.
The AMD Opteron took the two processor (socket) lead starting with the single core Opteron 2.8 GHz at 76 KB. This particular Opteron result was achieved with 32 GB memory compared with 16 GB in the two Xeon 3.6 GHz. Under normal circumstances, reducing the Opteron 2.8 GHz configuration memory to 16 GB should reduce the performance on the order of 10% for the given performance to memory (tpm-C/GB) ratio. Due to the peculiar memory options for Opteron systems, a 16 GB configuration would allow the use of PC3200 DDR versus PC2100 for the 32 GB configuration. So the best possible two-socket Opteron 2.8 GHz single core system probably would have achieved approximately the same result. The dual-core Opteron 2.6 GHz reached 113,628 also with 32 GB.
The dual core Prescott based 90nm processor, Smithfield, was limited by power dissipation to 2.8 GHz compared with the top single core frequency of 3.8 GHz. This was not sufficient to generate an impressive (read: publishable) performance result. Opteron, with a top single core frequency of 2.8 GHz, was able to fit the thermal envelope with a 2.6 GHz dual core version. On 90nm, the Opteron has comparable performance to the Xeon at single core, with frequencies of 2.8 GHz and 3.6 GHz respectively. In the dual core versions, the Opteron 2.6 GHz has a significant advantage over Xeon at 2.8 GHz. The 65nm dual core Dempsey (derivative of Prescott) was able to reach 3.73 GHz, with the result of 125,954, now with 32 GB memory. In theory, a 65nm dual core Opteron at 3.9 GHz should reach 40% higher than the 90nm 2.6 GHz 113 KB result. The 3.0 GHz Woodcrest on 65nm is at 169,360 with 64 GB memory. A similar dual socket Woodcrest result of 140 KB with 32 GB memory shows that doubling memory contribute 20% performance gain from approximately 4K tpm-C per GB performance to memory ratio.
Figure 14 shows the TPC-C performance for quad processor Intel systems from 2000 to 2006. The four-way Pentium III Xeon 900 MHz with 2 MB L2 cache result of 39 KB is a strong indication that the two-way Pentium III 1 GHz/256 KB L2 performance was probably constrained by its small cache. Otherwise, the four-way performance should not be more than double the two-way result. The Xeon MP with the Foster core at 1.6 GHz/1 MB reached 61 KB. The 130nm Gallatin core, introduced at 2.0 GHz with 2 MB L3 cache, achieved 78 KB. A later release at 2.8 GHz/2 MB reached 84.7 KB and the final version at 3.0 GHz with 4 MB L3 cache reached 102 KB (March 2004). The AMD Opteron gained the four processor lead starting with the single core 2.2 GHz at 105 KB (May 2004) continuing to the 2.6 GHz at 130 KB (February 2005).
Figure 13: TPC-C performance for quad processor Intel systems.
The Intel Xeon MP lineup on the 90nm process offered both the desktop core with 1 MB L2 and an MP server-only version with 8 MB L3 cache. The Cranford codename applied to the Prescott core in the Xeon MP form-factor, while the Potomac codename applied to the version with 8 MB L3. The top Cranford frequency was 3.66 GHz, and the top Potomac was 3.33 GHz. The Cranford processor regained the single core 4-socket system lead at 141 KB (April 2005). Interestingly, no four-way result was published for the Potomac core with 8 MB L3 cache. There was speculation that the latency to the L3 cache was excessive and negated its benefits. Potomac did produce respectable eight- and 16-way single core performance of 251 KB and 376 KB tpm-C, respectively.
The performance gain from Gallatin 3.0 GHz to Cranford 3.66 GHz does warrant some observations. There is a 20% frequency increase and a 40% performance gain. The memory configuration increased from 32 GB to 64 GB. Also significant is the chipset change from the ServerWorks GC-HE with a single 400 MHz FSB to support four processors to the Intel E8500 with two 667 MHz FSB, for two processors per bus. So some performance gain must be attributed to the improved memory system. Whether this is from memory bandwidth or memory transaction rate is unclear. The SPEC CPU 2000 integer results for Gallatin 3.0 GHz/4 MB L3 is comparable to Cranford 3.66 GHz/1 MB L2 (1379 and 1317), so it is possible the last of the Gallatin line on the ServerWorks chipset was memory bandwidth limited.
HP published TPC-C results for the four socket dual core Opteron systems: 2.2 GHz at 188 KB tpm-C (April 2005), 2.4 GHz at 206 KB (November 2005), and 2.6 GHz at 214 KB (march 2006). Paxville, a dual core version of Irwindale with 2 MB L2 cache at 3.0 GHz, with official names Xeon 7040 and 7041, yielded 188 KB (October 2005) but this was less than the best contemporary four socket dual core Opteron result, and about 12% less than the Opteron 2.6 GHz DC. The Xeon 7041 result was with 64 GB memory. At the tpm-C/GB ratio of 2,949, it is possible that increasing memory to 128 GB might have made up the difference. The Intel 8501 chipset can support 128 GB DDR memory, but all vendors have elected the DDR2 option with 64 GB maximum memory configuration.
Instead of employing the dual core 65nm Cedar Mill in the Xeon MP line, Intel plans to introduce Tulsa, which features 2 Prescott derived cores each with 1 MB L2 cache and a shared 16 MB L3 cache. Intel presentations claim a 1.7x increase over Paxville, which would be 320 KB. Presumably this is based on actual measurements with pre-production cores, and not an estimate. If we suppose that 15% of this gain relative to the Xeon 7041 result of 188 KB is from the frequency difference, and another 15% from larger memory configuration, this still leaves nearly a 30% gain that is attributed to improved scaling with the 16 MB shared cache. If all of this turns out to be the case, then whatever deficiency existed with the Potomac L3 cache has been resolved in Tulsa.
From the Pentium III 1.0 GHz/256 KB to the Xeon 3.6 GHz/2 MB, there was approximately a 4x increase in performance in dual processor systems and a 3.5x increase in the quad processor systems. The TPC-C benchmark depends on a combination of factors, including processor, memory, and disk I/O. The fact that the results still fall in line with SPEC CPU 2000 Integer indicates that the overall system architecture is being properly scaled with processor performance. With the new multi-core processors launching this year and in 2007, finally unconstrained by thermal limitations, look for performance gains in multi-threaded applications to exceed the pace of Moore's Law. The traditional doubling of the logic complexity of a processor was only expected to generate a 40% performance gain. An unconstrained dual-core can yield a 80% performance gain over the corresponding single core.
There are not enough published TPC-H results for a thorough comparative analysis. Care should be taken in comparing results. SQL Server 2000 had serious deficiencies in handling the very large queries of the TPC-H benchmark, especially on high-end server systems. SQL Server 2005 made significant improvements and is highly competitive with other big name DBMS products in data warehouse type applications. Some of this is illustrated in Table 5, showing the TPC-H 1000 GB scores for 16-way Itanium 2 systems with various SQL Server versions. The first is a Unisys result with the 1.5 GHz. The next two are Bull results with the slightly faster 1.6 GHz Itanium. Nearly all of the performance gain is due to improvements from SQL Server 2000 to SQL Server 2005. SQL Server 2000 had a highly questionable ability to benefit from parallel execution plans, especially beyond four processors. In fact, a parallel execution plan was as likely to cause performance degradation.
SQL Server Version | Freq. GHz | Cache | Mem. GB | No. of Disks | Power | Throughput | QphH | Report Date |
2000 | 1.5 | 6 MB | 64 | 214 | 7,331 | 3,687 | 5,199 | 10/15/03 |
2005 | 1.6 | 6 MB | 64 | 238 | 19,348 | 9,799 | 13,769 | 7/5/05 |
2005 + SP1 | 1.6 | 6 MB | 64 | 238 | 23,279 | 12,502 | 17,060 | 11/7/05 |
Table 5: TPC-H 1000 GB results for 16-way Itanium 2 systems.
All of these problems were completely fixed in SQL Server 2005. If that were not enough, yet additional improvements in the TPC-H benchmark scores were made in SQL Server 2005 SP1. The gains in SP1 may have been due to the ability to correlate date-time columns between tables to generate a better execution plan for the TPC-H queries, as opposed to broadly applicable improvements in large query handling and parallel execution plans. In any case, there is a very strong argument for migrating data warehouse applications to SQL Server 2005 without delay.
Table 6 shows some recent TPC-H 100 GB results on SQL Server 2005 x64 Enterprise Edition. The first two systems are the HP ProLiant ML570G4 and ProLiant DL585G1, the third is the Dell PowerEdge 2900. The Opteron result is the RTM version of SQL Server 2005, build 1399. The other two results are on SP1, build 2047. There is a Dell result on the 4 Xeon 7041 that appears to the RTM (15,209 power, 8,740 throughput, and 11,529 QphH). Given that a pre-release of SP1 was available in late 2005, it is unclear why SP1 was not used. It is not certain that SP1 produces performance gains at this level.
Processors | Freq. GHz | Cache | Mem. GB | No. of Disks | Power | Throughput | QphH | Report Date |
4 Xeon 7041 | 3.0 | 2x2 MB | 64 | 124 | 19,497 | 10,405 | 14,243 | 5/22/2006 |
4 Opteron 880 | 2.4 | 2x1 MB | 64 | 84 | 15,610 | 10,171 | 12,600 | 11/4/2005 |
2 Xeon 5160 | 2.2 | 512 KB | 48 | 128 | 13,052 | 8,409 | 10,477 | 7/10/2006 |
Table 6: Recent TPC-H 100 GB results on SQL Server 2005 x64 Edition.
Figure 15 below shows the stream 0 (Power) run times for each of the TPC-H queries (lower time is better performance). The HP system Xeon 7041 has better results in all but two queries compared to the Opteron 880.
Figure 14: TPC-H 100 GB Stream 0 (Power) run times by query.
Figure 16 shows the TPC-H 100 GB average time for five simultaneous streams (used to compute the throughput score) for the above system. Here the results are mixed between the Xeon 7041 and the Opteron 880, reflecting the very close overall throughput scores.
Figure 15: TPC-H 100 GB average multi-stream run times by query.
A TPC-H 100 GB scale factor means that the line item table data size is approximately 100 GB. With other tables and all indexes, the total database size is approximately 170 GB. Notice that the many of the 100 GB results are on systems with 64 GB of memory, the same as for the 1000 GB results in the 16 processor Itanium systems. It is unfortunate that there is not a broad set of results for the x86 systems at 300 GB and 1000 GB.
Xeon and Opteron Notes
At this point, it is possible to discuss the Intel Xeon and AMD Opteron performance characteristics with respect to SQL Server. It can certainly be said that AMD made good design decisions in the K7 to K8 generations, including significant innovations in the x86 market. The original K7 was a more advanced architecture than its contemporary, the Pentium III at 250nm, with a slightly deeper pipeline for higher frequency and much improved floating point units. The 250nm Katmai was listed at 9.5 MB transistors, with 2x16 KB L1 caches probably consuming 2 MB transistors. The K7 started at 22 MB transistors, with 2x64 KB L1 caches probably consuming 8 MB transistors. The difference in logic transistor count should be about 2x.
From this point, Intel followed the path of continuing to emphasize processor core complexity (advancing the micro-architecture) to meet Moore's Law with Willamette probably having on the order of 25 MB logic transistors (44 MB total with 256 KB cache). AMD elected to integrate the memory controller with emphasis on reducing memory latency. One Intel quick reaction design (Timna) essentially attached a north bridge to the Pentium III die, retaining a FSB in silicon with some (but not the best possible) reduction in memory latency.
The deep pipeline nature of the modern micro-processor and the large difference between core clock (<1ns)>
Two other significant innovations in the Opteron architecture include the following. Simultaneous bi-directional (SBD) point-to-point links between processors and I/O controllers. In the AMD line, this is called Hyper-Transport. SBD was used in the Intel 870 chipset for Itanium 2 processors to link nodes of a NUMA system. The general idea is to achieve maximum bandwidth per pin and the lowest latency in off-chip communication. In this area, SBD is better than the shared bus architecture, which the Xeon line carried over from the original Pentium Pro architecture of the mid 1990's. Also significant is the instruction set architecture enhancement. Simply extending the x86 instruction set architecture from 32- to 64-bit was definitely called for, but by itself does not require great innovation. The very significant, perhaps great, ISA innovation was in figuring how to expand the 64-bit mode to support 16 general-purpose registers over 8 in the 32-bit and previous modes. The principle remaining advantage of RISC architecture over x86 was the higher number of registers. Of course, most RISC architectures have already been nearly obliterated from the computer systems market thorough strong investment, good execution, and other reasons by the two major x86 companies.
One aspect of the Opteron processor and system architecture is that memory bandwidth scales with the number of processors because the memory controller is integrated into the processor. Adding processors adds memory bandwidth. One of the marketing arguments made for AMD over Intel is that memory bandwidth scales on Opteron and is a bottleneck on the Xeon system. The first part of the statement is true, and there are definitely applications that can use the additional memory bandwidth in Opteron systems compared with earlier generation Xeon systems. But unless the particular application requires the extra bandwidth, this would not be a bottleneck.
There is no definitive evidence that memory bandwidth on the Xeon platforms is a constraint to SQL Server performance. Many have reiterated the memory bandwidth argument in discussing Opteron versus Xeon on the subject of SQL Server performance without introducing clear evidence, even adding that this leads to better performance and scaling. Such a statement indicates a lack of understanding of the difference between performance and scaling. Scaling refers to the performance trend with increasing processors, but not absolute performance. So if architecture A has a one processor performance of 1.0, a two processor performance of 1.7, and a four processor performance of 2.55 (=1.7x1.5), while architecture B has a one processor performance of 1.0 (possibly on a difference scale), a two processor performance of 1.8, and a four processor performance of 3.06 (=1.8x1.7), then B has better scaling than A, while making no statement on the baseline performance.
Consider the following. The two processor 130nm Xeon 3.06 GHz, 1 MB L3 cache, 533 MHz FSB (4.3 GB/sec) achieved a TPC-C result of 52 KB. The four processor Xeon MP 3.0 GHz, 4 MB L3 cache, 400 MHz FSB (3.2 GB/sec) achieved 102 KB tpm-C. The next generation of four socket Xeon systems increased the combined FSB bandwidth to 10.6 GB/sec or better. So if there ever was a memory bandwidth bottleneck, this probably ended with the old ServerWorks GC-HE chipset.
Given that single core Xeon 3.6 GHz performance is comparable to single core Opteron 2.8 GHz for both two and four socket systems, there is every reason to indicate that the Opteron advantage over Xeon at dual cores is because the Opteron is only slightly constrained thermally while the Xeon is significantly constrained. So applying the memory bandwidth argument to SQL Server performance is nothing but mindless regurgitation of a marketing argument unsupported by facts and with nothing to indicate that it has any relevance to SQL Server.
It is possible that the Opteron memory architecture may have had an advantage in memory transaction rate, meaning the ability to fetch small blocks. This has to do with the number of pipes to memory, not necessarily the bandwidth to memory. None of this detracts from the fact that the Opteron processor and system architecture was highly successful, particularly the 90nm dual core generation that achieved a respectable performance lead over contemporary Xeons.
On TPC-H, there are not a sufficient range of results for comparison purposes.
System Replacement and Purchasing Strategy Implications
In part, the above discussion was to build guidelines for the replacement of existing systems and a strategy for purchasing new systems. Within reason, Intel has built the IA-32 line to Moore's Law, with performance doubling every two years. Presumably AMD intends to be competitive. The Itanium line fits certain needs. However, the slower succession cycle has rendered Itanium systems less competitive when a significant process generation gap develops. An example being Montecito at 90nm competing with Woodcrest at 65nm. The theoretical 50% advantage of the Itanium architecture is significantly reduced by the process generation gap.
In addition to Moore's Law, the overall system cost structure is being maintained or improved. That is, a two-socket server system in one generation is likely to be succeeded by the next generation system at approximately the same or lower cost. Another factor is that in any given generation, a four socket system will cost more than twice as much as two two socket systems. The same holds for an eight socket system relative to two four socket systems. Further, a four socket system should have slightly less than twice the performance of the two socket system when configurations are properly adjusted.
All of this translates to not buying much more performance headroom than needed in the near term, and relying on a frequent replacement cycle. For a database application with a 40% annual load growth rate, purchasing 2x headroom is a reasonable strategy. In two years time, the headroom is consumed, and then the system is replaced by a new system with twice the performance. This is far more effective than purchasing 4x headroom for a four year replacement cycle. For higher usage growth rates, a one year replacement cycle might be advisable.
It is a common accounting practice to depreciate computer hardware over five years. However, using this practice to mandate a five year replacement cycle for a server tasked to a specific application in a high growth rate environment can be impractical. If it is necessary to keep a particular system for five years, then consider rotating the servers running the most critical applications to less critical functions as supported by technical analysis.
For data centers where floor space and electrical power are serious considerations, a technology driven replacement cycle is even more important. Replacing systems on a two year cycle frees up floor space effectively doubling performance per volume. Consider that the other option involves significant capital outlay to expand the size of the data center. The new Woodcrest systems and SFF SAS disk drives can also significantly reduce power consumption per system over the recent previous generations, in addition to increasing performance per watt.
No comments:
Post a Comment