Subscribe

RSS Feed (xml)

Powered By

Skin Design:
Free Blogger Skins

Powered by Blogger


Monday, September 29, 2008

Server System Architecture, 2007 (Preliminary)

The original article in this series was published in 2002 and covered Pentium III and Pentium III Xeon systems. I had prepared an update in 2005 to cover the NetBurst (Pentium 4) based Xeon and Xeon MP systems. For some reason I never got around to publishing this article, but I did briefly mention some 2005-2006 system architectures in the "System and Storage Configuration for SQL Server" article of 2006. It is now time for an update to the Core 2 and Opteron based systems of 2007 and a look ahead to 2008. This is a preliminary article because information on upcoming systems is scarce. I will make some guesses now, and then update the article when release information becomes available. Then we can all see whether I guessed correctly or not.

The Intel architecture for Symmetric Multi-Processor computer systems dates to the early 1990s. The concept employed a common bus to connect multi-processors and the north bridge (with memory controller). This combined at the time an excellent balance between performance (multi-processor scaling) and cost of implementation (simplicity). One could argue that the old shared-bus architecture was obsolete by the early 2000s, and that the time for the high-speed point-to-point simultaneous bi-directional signaling technology in the AMD Opteron architecture had arrived. During that period, Intel was too heavily distracted by a sequence of crisis on other matters (Rambus, Timna, Itanium, no viable server chipset for the first two NetBurst generations, X64, the 90 nm process leakage current on top of the third generation NetBurst power issues to name a few) to contemplate this change. So the bus architecture was extended in Intel systems with twists delaying the transition to a point-to-point architecture until the 2008-2009 timeframe with the Nehalem processor core and Common System Interconnect (CSI).

This article will follow the Intel convention for discussing multi-core processors. What was once a processor is now a socket (even though some Intel processors in the past had fit in a slot, not socket). A processor will be called dual core (DC), quad core (QC) and so on at the socket level regardless of how many CPU cores reside on one die. A dual core processor can be two single core die in one (socket) package or a single dual-core die. A quad core socket can be two dual core dies or a single quad core die. There does not appear to be any indication of a performance difference between two single core dies in one socket package and a dual core die. So the argument that two die in package is not a true dual/quad core is just silliness with no relevance. Anyone making such an argument should support it with performance analysis. The fact that no performance argument has been made speaks for itself.



The Intel E8500/8501 Chipset

The E8500 chipset, featured Dual Independent (front-side) Bus (DIB) at 667 MHz, was introduced in early 2005. Prior to this, Intel had failed to produce a viable four socket chipset for the NetBurst processor line (Foster on 180 nm and Gallatin on 130 nm). Even the E7500/7501 for two socket Xeon systems did not garner design wins for the major OEMs. The two initial processors supported by the E8500 were the 90 nm NetBurst architectures, one with 1 MB L2 cache (Cranford) running up to 3.66 GHz, and the second with 8 MB L3 cache (Potomac) at 3.33 GHz. The next processors supported were the dual core Xeon 7000 line, a 90 nm NetBurst with 2 MB L2 cache composed of two single core die in one socket (Paxville) in 2Q 2006. This was followed by the Xeon MP 7100 line, featuring a 65 nm dual core die (Tulsa), with one L3 cache up to 16 MB shared by the two cores, in 4Q 2006. The Xeon 7000 and 7100 lines supported FSB operation at either 667 MHz or 800 MHz. The faster FSB was added in the E8501 update.

The E8500/8501 showed good performance characteristics with the singe core processors (141,504 tpm-C with 4x3.6 GHz/1 MB L2 cache). The dual core Xeon 7000 line with 2 MB L2 cache per core did not show good performance characteristics relative to the single core (188,761 tpm-C with 4x 3.0 GHz DC/2x2 MB L2 cache). The dual core Xeon 7100 with 16 MB shared L3 cache recaptured the four socket X86 TPC-C performance lead from Opteron with a very impressive result of 318,407 tpm-C, compared with 213,986 for the 4x2.6 GHz DC DDR1 Opteron and 262,989 for the 4x2.8 GHz DC DDR2 Opteron. Both Opteron systems were configured with 128 GB memory compared with 64 GB for the Xeon MP systems. The Opteron platform retained the four socket lead in TPC-H results. The big cache and Hyper-Threading (HT) make significant contributions to SMP scaling and performance in high-call volume database applications (318K tpm-C generates approximately 12,000 calls/sec), but not in very large DSS queries.

Figure 1 shows the system architecture of a four socket system built around the Intel E8501 chipset with the DIB architecture and Xeon 7100 processors.


Figure 1: Intel E8501 chipset with Xeon 7100 processor (2006).

The DIB concept is not new. This architecture was employed in the ProFusion chipset for the 8-way Pentium III Xeon architecture by a company later acquired by Intel. It was simply not possible to push the operation of a single bus shared by four processor sockets and one memory controller beyond 400 MHz of the previous generation Xeon MP platform (with ServerWorks GC-HE chipset). The E8500/8501 Memory Controller Hub (MCH) has four Intermediate Memory Interfaces (IMI) supporting a proprietary protocol (possibly a pre-cursor to FB-DIMM) instead of the native DDR2 protocol. The IMI connects to an External Memory Bridge (XMB) which splits into two DDR2 (DDR is also supported) memory channels. The XMB is described as a full memory controller, not just a memory repeater. There are a total of eight DDR2 memory channels. The maximum memory of the E8501 with DDR2 is 64 GB, so it is possible only two DDR2 DIMMs are configured on each channel.

The E8501 data sheet lists the IMI at 2.67 GB/sec outbound (write) and 5.33 GB/sec inbound (read), corresponding to the bandwidth of two DDR-333 MHz, even though DDR2-400 MHz is supported with the 800 MHz FSB. It is unclear whether this is an oversight or the actual value. Nominally, the maximum memory bandwidth is then 21 GB/sec even though 12.8 GB/sec is the limit of the combined DIB. In any case, it is the memory transaction rate that is important, not the nominal bandwidth.



The Intel 5000P Chipset and Derivatives

The current Intel two socket chipset is the 5000P introduced in 2Q 2006. The initial processor supported was the dual-core Xeon 5000 series, with two NetBurst 65 nm die in one 771-pin socket. Support for the Xeon 5100 series processor, with the new 65 nm Core 2 architecture and a single dual core die, was added only one month later. Later in 2006, support was extended to the Xeon 5300 series quad core processors, with two 65 nm Core 2 dual core die in a single package.


Figure 2: Intel 5000P chipset and Xeon 5100 processors (2Q 2006).

The 5000P has the dual independent bus architecture from the E8500/8501 chipset. That the 5000P is derived from the E8500/8501 is clear. Both have DIB, four memory channels, and 24 PCI-E lanes. Both are manufactured on the 130nm process, and in a 1432-pin package. One difference is that the long obsolete HI 1.5 interface to the south bridge for legacy devices has finally been replaced by the new ESI, introduced in desktop chipsets in 2004 as the DMI. Since there is only one processor and the MCH on each bus, it is possible to drive FSB operation up to 1333 MHz in the current generation.

A point-to-point architecture also has two electrical loads, but there are significant differences. One advantage of the Intel 5000P arrangement with two loads (one processor and one MCH) is the old bus architecture can be retained. Another advantage is that two die into one package constitutes a single electrical load, while two sockets each with a single die constitutes two electrical loads (a capacitance matter). The bus architecture supports multiple devices on one bus. This allowed Intel to simply place two single core die in one socket for a "dual-core" product without actually having to manufacture a new die with two cores, and again a "quad-core" product with two dual core die in one socket.

The advantage of a point-to-point protocol with recent (late 1990s) technology is a much higher signaling rate, and simultaneous bi-direction (SBD) transmission. Since the early 2000 timeframe, point-to-point signaling technology could support operation in the range of 2.5-3.0 GT/sec. The next generation will support 5 GT/sec. Compare this to the bus architecture, which reached 1.33 GT/sec in 2006 and is targeting 1600 in late 2007.

The 5000P has three x8 PCI-Express ports, and one Enterprise South Bridge Interface (ESI). Each x8 port can be configured as two x4 ports. The arrangement show above has six x4 ports. It is also possible to configure one x8 port + 4 x4 ports or two x8 ports and two x4 ports. Most vendors offer a mix of one or two x8 ports and four or two x4 ports. Only HP offers system based on the 5000P (ML370G5) and the E8501 (ML570G4) with six PCI-E x4 ports. This configuration provides the most PCI-E ports to drive disk IO. It is unclear whether any of the Intel 8033x IOP based SAS adapters can drive bandwidth to saturate more than a PCI-E x4 port.

The ESI is described as having extensions to the standard PCI Express specification. A curious point about the 631x ESB and 632xESB is that they connect to the MCH via not just the ESI but also an additional full x8 PCI-E port. Now on the downstream side of the ESB are two x4 PCI-E ports along with a plethora of standard IO ports. Note that the sum of the downstream ports exceeds the bandwidth of the upstream ports to the MCH. The general observation is that computer system IO traffic exhibit mostly burst operations and is highly no uniform. So it is highly unlikely that all devices are consuming bandwidth simultaneously. The upstream connection of x8 PCI-E lanes in addition to the ESI port ensures the ability to handle a combination of events on legacy IO side, and is still available to provide ports for PCI-E based traffic.


Other Intel 5000 Chipsets

There are several variations on the Intel 5000 chipset. The 5000Z has DIB, two memory channels, 16 PCI-E lanes and the ESI. Basically two memory channels and one of the x8 PCI-E ports are disabled from the 5000P. The 5000V has DIB, two memory channels, one x8 PCI-E and the ESI connection to the ESB. The 5000X on the other hand, has all the features of the 5000P, plus a 16 MB snoop filter, which will be discussed in the next section. It is not clear if there is really only one 5000 MCH die from which the P, X, V, and Z variations are derived by selectively disabling components.



The Snoop Filter Cache

The 5000X is targeted at workstation applications. The 5000X data sheet describes the Snoop Filter cache in the Chipset Overview section as:

One of the architectural enhancements in Intel 5000X chipset is the inclusion of a Snoop Filter to eliminate snoop traffic to the graphics port. Reduction of this traffic results in significant performance increases in graphics intensive applications.

Later in 5000X data sheet, on Functional Description chapter:

The Snoop Filter (SF) offers significant performance enhancements on several workstation benchmarks by eliminating traffic on the snooped front-side bus of the processor being snooped. By removing snoops from the snooped bus, the full bandwidth is available for other transactions. Supporting concurrent snoops effectively reduces performance degradation attributable to multiple snoop stalls. See Figure 5-1, "Snoop Filter" on page 302.

The SF is composed of two affinity groups each containing 8K sets of x16-way associative entries. The overall SF size is 16 MB in size. Each affinity group supports a pseudo-LRU replacement algorithm. Lookups are done on a full 32-way per set for hit/miss checks.

As shown in Figure 5-1the snoop filter is organized in two halves referred to as the Affinity Group 1 and Affinity Group 0 or the odd and even snoop filters respectively. As shown in Figure 5-1 Affinity Group 1 is associated with processor 1 and Affinity Group 0 is associated with processor 0. Under normal conditions a snoop is competed with a 1 snoop stall penalty. When the processors request simultaneous snoops the first snoop is completed with a one snoop stall penalty, the second snoop requires a 2 snoop stall penalty.

For the purposes of simultaneous SF access arbitration, processor 0 is given priority over processor 1. Thus simultaneous snoops are resolved with a 1 snoop stall penalty for processor 0 and a 2 snoop stall penalty for processor 2.

The SF stores the tags and coherency state information for all cache lines in the system. The SF is used to determine if a cache line associated with an address is cached in the system and where. The coherency protocol engine (CE) accesses the SF to look-up an entry, update/add an entry, or invalidate an entry in the snoop filter.

The SF has the following features:

Snoop Filter tracks total of 16 MB of cachelines (218 L2 lines).

8K sets organized as one interleave via a 2 x 16 Affinity Set-Associativity array.

There are a total of 8K x 2 x 16 = 256K Lines (218).

2 x 16 Affinity Set-Associativity will allocate/evict entries within the 16-way corresponding to the assigned affinity group if the SF look up is a miss. Each SF look up will be based on 32-way (2x16 ways) look up.

The array size of the snoop filter RAM is equivalent to 1 MB plus 0.03 MB of Pseudo-Least-Recently-Used (pLRU) RAM.

There are several additional items listed for the Snoop Filter features that not listed. Refer to the 5000X data sheet for these items.

I was somewhat confused at first in seeing some presentations describing the snoop filter as improving performance in workstation applications but not server applications. It is now clear that the more correct interpretation is that the Snoop Filter implementation in the 5000 chipset actually did significantly improve several workstation benchmarks but did not show improvement consistently for server benchmarks. From the description of next generation of chipsets, I am inclined to think that Intel believes the Snoop Filter should improve server performance and is working to that goal.

Also confusing is the Snoop Filter size. The 5000X has a 16 MB snoop filter cache. This does not say that there is a 16 MB cache in the 5000X that the snoop filter uses. Rather the Snoop Filter can support a total of 16 MB cache on the processors. The 5000X chipset supports two sockets. For the Xeon 5100 series, there is one 4 MB L2 cache in each socket, and for the Xeon 5300 series, there are two 4 MB L2 cache for each socket. So the maximum combined processor cache in the 5000X platform is 16 MB. Each cache line on all NetBurst and Core 2 processor lines is 64 Bytes. So 16 MB cache contains 256 KB cache lines, for which the Snoop Filter requires a little over 1 MB RAM to handle this.

The following information is from various slide presentations at Intel Developer Forum Spring 2007. The Snoop Filter is a cache tag structure stored in the chipset. It keeps track of the status of cache lines in the processor caches. It contains only TAGs and status of cache lines and not the data. It filters all un-necessary snoops to the remote bus. The Snoop Filter decreases the FSB utilization, forwards only requests that need to be snooped to the remote bus, a cache line that could potentially be present in a dirty state on the remote bus, cache lines that need to be invalidated. It filters all other processor snoops and large fraction of IO snoops. All IO bound applications benefit automatically. (There is a snoop filter in the E8870 SPS, the crossbar of the Itanium 2 chipset).



The Next Generation Intel Chipsets

Several next generation Intel chipsets have been described in various public forums. One is for a four socket system supporting the Core 2 architecture processors. The NetBurst and Core 2 architectures share a common FSB protocol, allowing a chipset to support both processor lines, like the 5000. The E8501 chipset can operate Core 2 processors. Now the two socket Core 2 platform showed good performance scaling from a 3.0 GHz dual core Xeon 5160 on each 1333 MHz FSB to the quad core 2.6 GHz X5355. It is very likely that two dual core Xeon 5100 processors each with a shared 4 MB L2 cache on a single 800 MHz FSB (the maximum for three loads) will not scale nearly as well, and certainly not two quad core Xeon 5300 processors. It is also uncertain that a Xeon 5160 could challenge the Xeon 7140 with 16 MB share L3 cache in the four socket E8501 platform. So while there the Core 2 and E8501 combination is possible, there is no business reason to do so. Hence, the next generation four socket Xeon platform needs to feature quad core processors to be performance competitive.

One solution would be to increase the Core 2 on-die cache to the 8-16 MB range and still operate two processor sockets sharing one 800 MHz FSB. It would however seem strange explaining to a customer that the high-end Xeon 7300 has two sockets sharing one 800 MHz FSB while the dual socket Xeon 5300 line operates a single socket per bus at 1333 and later 1600 MHz, even this scenario has been the case for several years now. Last point, the bus architecture originating with Pentium Pro specified four processors per bus, so two quad cores on one bus might not work period.

It appears that Intel is now confident in designing chipsets with the multiple independent processor bus architecture (there was a lapse of several years between the ProFusion in 1999 and the E8500 in 2005 when Intel did not have a contemporary DIB chipset). The next generation four socket solution is shown in Figure 3 below.


Figure 3: Intel Clarksboro chipset with Tigerton processors (Q3 2007).

The processor code named Tigerton will be the Xeon 7300 line. The chipset codename is Clarksboro and the platform codename is Caneland. Note the quad-independent bus (QID?) architecture. The Tigerton processor is the same 65 nm Core 2 with 4 MB shared L2 cache and two die in one package arrangement as in the Xeon 5300 line. The primary option is a quad core processor. There is a dual core option with a single die for applications that require bandwidth but not quite so much processor power. Changing a silicon die even to increase the cache size is not a light undertaking at Intel. This solution allowed the use of an existing processor silicon die while supporting a reasonable 1066 MHz FSB. It is unclear why Intel did not elect the 1333 MHz FSB that is available in the 5000P chipset. Also unclear is the 64 MB Snoop Filter. Even with four quad core Penryn, the total cache is 48 MB. Is there an undisclosed 45 nm Penryn with 8 MB L2 cache?

There are also four memory channels with 8 FB DIMMs per channel. Perhaps there is a XMB type memory controller or perhaps FB-DIMM allows eight devices on a single channel. Maximum memory configuration is 256 GB with 8 GB DIMMs in 32 sockets. The Clarksboro chipset is listed as operating with 533 and 667 MHz FB-DIMM. Unless each memory channel actually has more bandwidth than a single FB-DIMM channel, the Clarksboro chipset has the same memory bandwidth as the 5000P.

In any case, this is clearly the end of the line for FSB based multi-processor systems. The bandwidth per pin advantages of a point-to-point protocol is required to support multi-processor (socket) systems in future generations.

There are two distinct dual socket chipsets described for the upcoming Penryn processor. (Core 2 architecture on 45 nm plus so additional enhancements and a larger 6 MB shared L2 cache. Some Penryn enhancements relevant to server applications include faster OS primitive for spinlocks, interrupt masking and time stamp). The first shown below is the chipset codenamed Seaburg and platform codenamed Stoakley.


Figure 4: Intel Stoakley platform/Seaburg chipset (est. 2H 2007).

The FSB frequency is now up to 1600 MHz. The Seaburg MCH has a snoop filter for 24 MB cache, so this will support four die with 6 MB cache. Memory remains 533 and 667 MHz FB-DIMM but 800 MHz later would not be unexpected. Maximum memory will become 128 GB with 16x8 GB DIMMs. The interesting new feature is PCI-Express Generation 2. The Seaburg PCI-E IO can be configured for 44 Generation 1 lanes or 2 x 16 Generation lanes plus the connections for the ESB. PCI-E Generation 1 is the original simultaneous bi-directional 2.5 GT/sec. Generation 2 is 5.0 GT/sec. It is not clear if 2x16 PCI-E Gen 2 lanes can be configured as 8x4 Gen 2 lanes or if Gen 2 is graphics exclusively (initially).

The significant increase in both IO bandwidth and PCI-E slots is highly appreciated. Hopefully there will be powerful SAS controllers that can be matched with the PCI-E Gen 2 lanes. The preferred configuration depends on the available SAS controller. If only the old IOP 8033x, then 10 x4 PCI-E Gen 1 slots is probably best. If the new Intel IOP 8134x can drive a full x8 PCI-E port, then five x8 PCI-E Gen 1slots is a good choice. If a new IOP controller is available with PCI-E Gen 2, then x4 Gen 2 slots is the choice.

The second two socket chipset is codenamed San Clemente with platform codename Cranberry Lake. There are two DDR2 memory channels, although a later version supporting DDR3 should be expected. There is a configurable set of PCI-E ports and the ESI to connect the desktop ICH 9 south bridge. This combination is targeted at lower cost, power, and high-density two socket systems, similar to the 5000Z and 5000V.


Figure 5: Cranberry Lake platform with San Clemente chipset (est. 3Q 2007).

For unspecified reasons, it was determined that the low cost platform should use the same DDR2/3 memory solution as desktop platforms instead of FB-DIMM. Hopefully there will be a DDR3-1600 option for this platform. There were no details on the PCI-E configuration, but given the workstation interest in this platform, a two x16 PCI-E Gen 2 and other combinations are probable.



AMD Opteron Platform

The Opteron server platforms were introduced in 2003 with more major hardware vendors adopting it by the 2004 timeframe. The Opteron processor featured three 16-bit Hyper-Transport (HT) links, each of which can be configured as two 8-bit links, two DDR1 memory channels. The platform included an IO bridge from HT to PCI-X and other legacy IO devices. Dual Core Opteron processors became available in mid-2005. In early 2007, the Dual Core Opteron processors were updated to DDR2 memory.

The Opteron processor variants were determined by the number of HT links enabled. A single socket platform would only need one HT link for IO. In a two socket system, each processor requires one HT link to connect to the other processor, and the second HT link to connect to an IO bridge. In a four socket system, three HT links on each processor are required, two to connect to other processors and one for IO, as shown in Figure 6 below.


Figure 6: Opteron four socket platform with DDR2 and PCI-E (1Q 2007).

Note that two of the four Opteron processors do not connect the third HT link to an IO bridge. No vendors to my knowledge mix of two and three HT port Opteron processors in a system. The system shown above features 20 PCI-E lanes on each HT port, one HT port configures 1x8 plus 3x4 and the other HT port configures 2x8 plus 1x4 PCI-E slots. Now the downstream IO bandwidth on each HT port adds up to 5 GB/sec for the PCI-E lanes plus the bandwidth for the PCI-X and other common IO devices. Again, this is in line with the assertion that IO bandwidth is highly non-uniform temporally and it is definitely desired to over-configure the down stream bandwidth with a higher number of available slots. It is should be up to the intelligent system administrator to correctly place IO adapters in the case concurrent IO traffic is required.

Opteron Core

There are two major elements of the Opteron processor that distinguish its performance and scaling characteristics. One is the integrated memory controller. The second is the point-to-point protocol in the HT links. In the old bus architecture, a memory request is sent over the front-side bus to a separate memory controller chip before it is issued to the actual memory silicon. The reverse path is followed for the return trip. For the last ten years, the processor core has operated at much high frequency than the bus. Note also that for the Xeon 5100 line, the FSB operates at 1333 MHz for data transfers, but the address rate is 667 MHz, compared with the top Core 2 processor frequency of 3 GHz. On a memory request, there is the delay to synchronize with the next FSB address cycle.

While transistors on recent generation process technology can switch at very high speeds, every time a signal has to be sent off-chip, there are significant delays because of the steps necessary to boost the current drive of the signal to the magnitude required for off-chip communication compared to intra-chip communication. So the Opteron integrated memory controller reduces latency on both the bus synchronization and the extra off-chip communication time.

In the multi-processor Opteron configuration shown in Figure 6, there are three possible memory access times. In the AMD NUMA notation, memory directly attach to the processor issuing the request is called a 0-hop access. Memory on an adjacent processor is called a 1-hop access. Memory on the far processor is a 2-hop access. It would seem that the non-local 1-hop memory access has the same distance as the Intel processor over FSB to Memory controller arrangement. However, the lower synchronization delay on HT compared with FSB favors Opteron.

Concerning memory performance and multi-processor scaling, the number of memory channels scales with the number of processors (sockets) in the Opteron platform. This is currently two memory channels per processor, meaning four memory channels in a two socket system and eight in a four socket system. The current Intel systems have four memory channels in both the two and four socket systems (depending on how the XMB counts). The previous generation Intel two socket system had two memory channels. So this would favor Opteron at four sockets. Most marketing material from AMD emphasizes the bandwidth of Opteron systems. For many server applications, it is the memory transaction performance that is more important, not the bandwidth. While in the Opteron case bandwidth and memory channels do scale together, it is still more correct to put emphasis on the memory channels, not the bandwidth number.

The Opteron processor features first level 64 KB Instruction and 64 KB Data caches, and a unified 1 MB L2 cache. The relatively low latency to memory allows Opteron to function well with a smaller on-die cache than comparable generation Intel processors. The Dual Core Opteron has an independent 1 MB L2 cache for each core, compared with the Intel Core 2, which has a single 4 MB L2 cache shared by the two cores. It is unclear if one of the two arrangements has a meaningful advantage over the over.

Hyper-Transport

The current generation Opteron processors feature HT operating at 2 GT/sec (that is two giga-transfers per second). Since the full 16-bit HT link is 2 bytes wide, the transfer rate is 4 GB/sec in each direction. As this is a point-to-point protocol, it is possible send traffic in both directions simultaneously, making the full bandwidth 8 GB/sec per 16-bit HT link and 24 GB/sec over three HT links on each processor. I prefer to cite unidirectional bandwidth in discussing disk IO because the bulk of the transfer is frequently in one direction at any given time.

There is occasionally some confusion in HT operating frequency. At the current 2 GT/sec, my understanding is the clock is 1 GHz, with a transfer occurring on both clock edges, making for 2 GT/sec. When the Opteron platform was introduced, the operating speed may have been 1.6 GT/sec. The next generation HT to be introduced in late 2007 or early 2008 can operate at 5.2 GT/sec with backward compatibility for the lower frequencies.

Next Generation Opteron

The next generation Opteron processor, codename Barcelona, features a Quad Core die, dedicated 512 KB L2 cache for each core, a shared 2 MB L3 cache, micro-architectural enhancements over the K8, memory controller improvements, and four HT 3.0 links. The Barcelona cache arrangement is interesting. The L2 cache dedicated to each core is reduced to 512 KB from the previous generation of 1 MB, and there is now a 2 MB shared L3 cache. All L2 and L3 caches are exclusive. Any memory address can be in only one of the four dedicated L2 or the one common L3 caches. So there is effectively 4 MB of L2/L3 cache on the die spread across the five different pools. Intel did not like exclusive caches, preferring an inclusive arrangement where any memory in L2 cache must also be in L3 cache. This requires that the L3 be much larger than the L2 to be effective. The implication is that certain SKUs, for example one with a 512 KB L2 and 1 MB L3 did not really benefit from the L3.

The Barcelona information available states socket compatibility with current generation Opteron platforms. This means that memory remains two DDR-2 channels per socket at 533 and 667MHz, and the three HT links operating at 2 GT/sec. Does the current 1207 pin Socket-F actually support all four HT links? Or are the current Opteron platforms only capable of operating three HT links, and a new socket (platform) is required to support four HT links? The documents on the Hyper Transport Web site describe four and eight socket systems where each processor is directly connected to all other processors. In the four socket system, the connections use a 16-bit HT link leaving one 16-bit HT link available for IO on each socket.

In the eight socket system, processors are connected with 8-bit HT links, leaving one 8-bit HT link available for IO per socket.

The Barcelona HT links can operate at 5.2 GT/sec. So a new platform with the 5.2 GT/sec links should have better performance characteristics, especially in multi-processor scaling. It is probable that a later platform modification would have DDR3 memory channels. FB-DIMM is discussed as a possibility, but given the continued use of DDR2/3 in desktop systems, it is unlikely AMD would need to transition the Opteron platform to FB-DIMM.

The 4x16/8x8 HT configuration along with the effort to make HT 3.0 an IO protocol will allow interesting possibilities in SAN architecture. The primary emphasis on HT connected IO devices will probably be very high-end graphics and special co-processors. Both HT 3.0 and PCI-Express Generation 2 will operate at 5+ GT/s. PCI-E is an IO oriented protocol, while HT has protocols for processor-to-processor communication, and there was a much greater emphasis on low latency operation in HT compared with PCI-E. This is why the primary emphasis should be high-end graphics and co-processors. But there could be opportunities for SAN interconnects as well.

Now a SAN is really just a computer system that serves LUNs, similar to a file server that makes a file system available to other network computers. The primary connection technology is Fiber Channel. A SAN can also operate over any network protocol, Gigabit Ethernet + TCP/IP for example. The addition of iSCSI is to make this more efficient. FC however is not really up to the task of supporting high bandwidth large block sequential data operations (a table scan or even a database backup). It would be possible to directly connect a server to a SAN over HT. And an HT adapter is not even needed. The server and SAN simply connect with an HT cable. On the SAN side, there would probably be an HT to PCI-E bridge followed by PCI-E to SAS or FC adapters. So all this could be built without a special adapter, beyond HT based server and SAN systems.

Eight Socket Systems

Previous generation eight socket systems were NUMA systems, that is, Non-Uniform Memory Architecture (usually cache-coherent, or ccNUMA). Now strictly speaking, even a two socket Opteron is NUMA. The difference is that the older eight socket systems had a very large difference in memory access times from remote nodes compared to the local node. Local node memory access might be 150 ns while remote node could be over 300 ns. Opteron platforms might have memory access times of 60 ns for local, 90 ns for 1 hop, and 120 ns for 2 hop, which does not show adverse NUMA platform effects.

Most IT shops and ISVs were completely unaware of the special precautions required to architect a database application scalable on NUMA platforms. Several important major ISV applications even have severe negative scaling on NUMA systems, due to key design decisions based on experience with non-NUMA systems or even strictly on theoretical principles completely disconnected and at odds with real platform characteristics. The consequence was that most database applications behaved poorly on NUMA systems relative to a four socket SMP system, whether the DBA knew it or not. So naturally I am very curious to find out whether important Line of Business database applications will scale well on the eight socket Barcelona HT 3.0 platform.

Opteron platforms may have setting for NUMA or SUMA memory organization. I will try to discuss this in a later revision.



Server System Architecture Summary

The 2006-2007 platforms from both Intel and AMD offer major advances in processor, memory, and IO capability over otherwise previous generation platforms. Shops that operate a large number of servers should give serious consideration to replacing older platforms. Significant reduction in the number of servers, the floor space required and power consumption can be achieved. The floor space savings is especially important if it is rented from a hosting company.

Should a DBA upgrade to solve performance problems? Any serious problems in the design of a database application and the way it interacts with SQL Server should always be addressed first. After that, balance the cost and benefit of continued software performance tuning with a hardware upgrade. If you have an older (single core) NUMA system, consider stepping down to a two socket quad core or a four socket dual core, or step down when the four socket quad core is available. There is a strong likelihood that the newer smaller platform will have better performance than the older system and will not have adverse NUMA behaviors.

Are either the AMD or the Intel platforms better than the other? As of May 2007, at the two socket level, the Intel Quad-Core Xeon 5300 line on the 5000P chipset offers the best performance with good memory and IO capability. At the four socket level, performance is mixed between the Opteron 8200 and the Xeon 7100, with large query DW favoring AMD and high call volume OLTP favoring Intel. However, the Opteron will probably run many applications equal or better without special tuning skills and has better memory and IO configurations.

The next generation platform competition will begin soon. In the absence of hard information, I am going to speculate that at four sockets, Barcelona will have the advantage over the 65 nm Tigerton processor and Clarksboro chipset. This is based on the two socket quad core Xeon X5355 having comparable performance to the four socket dual core Opteron 2.8 GHz. Projecting forward, the Xeon 7300/Clarkboro combination gets the snoop filter plus 256 GB memory up from 64 GB. The Barcelona transition benefits from micro-architecture enhancements. The 45 nm Penryn processor with Seaburg chipset will have the advantage at two sockets. This is based on the architecture enhancements in Penryn and the frequency headroom of the 45 nm process. A four socket Barcelona advantage would put pressure on Intel to make the 45 nm processor available with the Clarksboro chipset sooner than later. Will AMD elevate the competition to eight socket platforms?

2 comments:

  1. what is treatment for fatty liver disease what is treatment for
    fatty liver disease what is treatment for fatty liver
    disease

    Here is my webpage: how to cure alcoholic fatty liver disease

    ReplyDelete
  2. Working with an student loan people connection, you can use student loan people
    on your PC, you can list and block those as well.
    The young female langurs in the troop remain unhurt, as they compete against the instant and unregulated
    free-for-all of the net surfers online don't have much to keep track of the sites. As the nature of the Student Loan People's domain name system would not
    have seen such a phenomenal development in the last two years.


    Feel free to surf to my web-site; Student Loans People

    ReplyDelete

Recent Posts

Archives