Introduction to “Magny-Cours”
The AMD Opteron™ 6100 Series Processor, code-named “Magny-Cours”, to be released on March 29, 2010 will be the first 12-core x86 processor on the market. The AMD Opteron 6000 Series platform, also known as Socket G34 “Maranello” platform, will house the AMD Opteron 6100 Series processor and targets the enterprise market. Key features of the platform are Direct Connect Architecture 2.0 (system scalability), AMD-V™ 2.0 (HW virtualization, including AMD-Vi for I/O virtualization [aka IOMMU]), and AMD-P 2.0 (power consumption). This article will primarily discuss the “Magny-Cours” processor, the technologies that make up Direct Connect Architecture 2.0, and how these come together to enable the effective use of so many cores in a multiprocessor system. The article will also take a look at a few of the software tools AMD has been developing to help users exploit these cutting edge technologies.
Figure 1 gives a high level example of the AMD Opteron 6000 Series platform, in this case in a 4P configuration. With 8-core processors installed, this system could support 32 cores, and with 12-core processors installed, this system could support up to 48 cores – an unequalled number of true physical cores on 4P systems.
Figure 1.
What’s Direct Connect Architecture 2.0? From a 20,000 feet view, the most obvious characteristic is increased core count, up to 12 cores from a maximum of 6 cores from the previous generation of AMD server processors. The shared L3 cache correspondingly increases to 12MB from a previous maximum of 6 MB. Each core has its own L1 and L2 caches. Next, each processor has 4 HyperTransport™ 3.0 Technology (HT3) links, each capable of operating at a rate of 6.4 GT/s, increased from a maximum of 2 GT/s with HT1. This gives a maximum HT bandwidth of 25.6 GB/s per processor. Another piece of this technology is HT Assist, which we’ll explain more later. Finally, “Magny-Cours” comes equipped with the well known integrated memory controller technology AMD pioneered in x86 on the first Opteron in 2003. “Magny-Cours” has doubled the number of channels seen on previous AMD products for a total of 4 DDR3 channels, supporting up to 1333 MHz DDR3 memory. All of these technologies compose the latest and greatest Direct Connect Architecture 2.0.
Under the Hood of Magny-Cours
Let’s talk about how AMD can deliver so many physical cores and memory channels in each processor. Socket G34 processors such as “Magny-Cours” are actually packages that contain 2 dies, with each die containing 4 or 6 cores. Each die has a 6 MB L3 cache that is shared by the cores on the die, thus adding to a total of 12 MB of L3 cache on the processor. Each die has 2 memory channels and 4 HT 3.0 links. And because of the 2 memory channels on the die, each die is a NUMA node (Non-Uniform Memory Access). (See AMD’s New Designs on Software: NUMA for more information on NUMA.) Figure 2 below gives us a logical view of this. The picture on the left shows us a view of the processor package with the two pieces of silicon. The picture on the right is a view of the processor showing the two NUMA nodes. The thicker lines (labeled x16) show HT 3.0 links with all 16 bits in use. The thinner lines (labeled x8) indicate the link split into 8-bit connections. cHT stands for coherent HyperTransport --- these links are used for CPU-to-memory traffic. nHT links are non-coherent, they are used for moving data between memory and I/O devices.

Figure 2.
An important consequence is that an AMD 6100 Series Opteron Processor is now two NUMA nodes, since there are two memory controllers in each package. With previous generations each Opteron processor was only one NUMA node. In Figure 3 below, an example topology of a 2P configuration is provided. First, realize that each of the elements P0 through P3 represent a die. P0 and P1 are the dies (nodes) in the first processor, and P2 and P3 are the dies in the second processor. Second, note that every NUMA node is connected to every other NUMA node. This would not be possible without the addition of the fourth HT 3.0 link.
Figure 3.
With so many cores needing quick access to data, especially in a multi-processor configuration, it was obvious that the interconnect needed to be equal to the task. AMD improved many features seen on previous products in order to ensure that all the cores can quickly get the data they need. We already mentioned HyperTransport 3.0, which increased from 4.8 giga-transfers per second (GT/s) on the six-core processors code-named “Istanbul” to 6.4 GT/s for an effective bandwidth of 25.6 GB/s per processor! And, the fourth HT link helps decrease memory latency of node-to-node memory traffic by reducing the number of hops between nodes.
In multi-processor setups, a major scalability problem to overcome is memory bandwidth and latency bottlenecks caused by cache coherency traffic. Specifically, with each memory access, cache coherency requirements cause cache probes to be broadcast to each processor in the system - a potentially time consuming process that can leave a processor core idle. With so much cache probe traffic being passed around, a system's communication network can get bogged down. Memory bandwidth is reduced and average latency increases. With “Magny-Cours” however, a feature called HT Assist (also known as a probe filter) is implemented to address this scalability problem. HT Assist works to reduce the number of probe requests that are sent out by maintaining a cache directory in a portion (the size of the portion is configurable but it would typically be 1 MB) of each node's L3 cache. HT Assist sharply reduces the number of requests that have to go outside of the local node. This results in additional bandwidth for other requests as well as less latency for most memory accesses.
Also fundamental in the “Maranello” platform is an
I/O virtualization feature known as IOMMU. This is a chipset feature that provides significant speedup for I/O functions in virtual machines among other benefits. For example, without an IOMMU, a VM guest cannot directly access an I/O device because it could potentially program the device in such a way that the device could corrupt other guests’ memory. IOMMU allows I/O devices to be assigned directly to a given VM guest by limiting memory addresses that devices can access to only those that belong to this guest. Regarding performance, without an IOMMU, hypervisors must emulate devices, which causes significant performance degradation. Therefore, support for an IOMMU by hypervisors can increase performance, reliability, and security. The end result is that virtual machines start to perform more and more like actual native machines.
Some of the AMD-P 2.0 power savings features include AMD CoolSpeed Technology, which reduces p-states when a temperature limit is reached, and C1E, a sleep state invoked when all processor cores in a system are idle. These have significant end-user benefits, discussed in the next section.
End-User Benefits
AMD CoolSpeed technology allows greater platform efficiency with safe fan speed reduction, and continued operation when the thermal environment exceeds its operational limits. With the C1E sleep state (all cores in a system are idle), Northbridge and HyperTransport links are powered down, adding up to significant power savings in the datacenter. In general, the AMD Opteron 6100 Series processors’ Average Consumed Power (ACP) is the same as previous Opteron processors using Socket F (1207). So a 12-core 6100 Series Opteron Processor has double the cores as an older 6-core part, yet consuming roughly the same amount of power.
Perhaps most importantly, one of the hallmarks of AMD's development philosophy has been to maintain compatibility across multiple product releases, driving down costs for OEMs and end-users alike. The “Maranello” Socket G34 platform is no different in this regard. AMD plans to design future products around this same socket, providing important cost savings to customers by reducing the number of redesigns required in the future. Platform longevity continues to be a top priority for AMD.
Finally, the combination of up to 12 physical cores per processor and high memory bandwidth deliver more performance and are a great match for applications that demand more and more data. Very high FLOP rates and STREAM memory bandwith up to 100 GB/second or higher are now possible with AMD Opteron 6100 Series processors.
Enabling Developers
At this point you are likely wondering how to start working effectively with the new capabilities of the “Maranello” platform for your business's needs. How do you take advantage of 12 highly connected cores? Or possibly up to 48 cores if you're dealing with a 4P “Magny-Cours” setup? Fortunately, AMD has been working on tools to target these issues for years. It's important to note that the microarchitecture of “Magny-Cours” is AMD Family 10h; this family also includes earlier products codenamed “Istanbul”, “Shanghai”, and “Barcelona”. This means the underlying core microarchitecture is largely the same - code that is tuned for performance on one Family 10h core performs equally well or better on another. AMD has a multitude of tools already developed and tuned for those products which will work just as effectively for implementing solutions for “Magny-Cours” as for prior 10h products. Not only are there tools for squeezing the most performance out of the Family 10h microarchitecture, but there are also tools dedicated to exploiting today's highly parallel multi-core systems such as “Magny-Cours”. Let's take a look at a few.
First up we have a tool called CodeAnalyst Performance Analyzer. CodeAnalyst is a suite of tools that is supported on both Windows and Linux platforms. It helps analyze software performance on AMD microprocessors with the latest release supporting AMD processors through “Magny-Cours”. With CodeAnalyst, software developers can identify performance bottlenecks and get visibility into overall system performance. The tool provides various levels of granularity for viewing application performance, allowing you to drill down into software layers as deep as necessary to find problem areas or "hot spots" that negatively affect code performance. Typical usage of CodeAnalyst involves a 4 step process: measure performance, identify "hot spots", identify the cause, and change the program. Several iterations through this sequence are typically necessary to peel away the layers of performance problems, but the final results are worth the effort.
CodeAnalyst includes both a GUI interface and a command line interface. The GUI provides a variety of graphs and charts to help decipher the results of the application analysis. For example, CodeAnalyst can give details about data cache hit rate and help you find the areas of your software that are exhibiting the most cache misses. Correcting such problems is critical for any enterprise application running in a multiprocessor, multicore environment. Further, thread profiling is a feature of CodeAnalyst for Windows that helps the user understand core utilization and the extent to which threads can execute in parallel. It will also point out remote memory accesses - a performance penalty ideally avoided in a NUMA environment such as the “Maranello” platform. CodeAnalyst is an essential tool offered on AMD Developer Central at no charge and will help bring out the best from your “Magny-Cours” system.
As any developer knows, leveraging code libraries is the only way to effectively build large scale applications. AMD provides a number of libraries that are tuned for performance on AMD Family 10h processors called AMD Performance Libraries. Among these libraries is the AMD Core Math Library (ACML). The ACML provides a variety of math functions tuned for top performance on AMD architectures such as Basic Linear Algebra Subroutines, Fast Fourier Transform routines, array math transcendentals, and random number generators. ACML is integral for HPC, scientific, engineering and related compute-intensive applications. Visit AMD Developer Central for more details on ACML including supported compilers for Windows and Linux. Other AMD Performance Libraries of potential interest to “Magny-Cours” users include the AMD String Library which provides string functions optimized for AMD systems.
The x86 Open64 Compiler Suite is another initiative that AMD spearheaded in order to better meet developer and end-user needs in the Linux compiler realm. The latest release of the Open64 Compiler Suite includes improvements targeted at code generation for highly parallel systems like “Magny-Cours”. In addition, Open64 supports OpenMP 2.5 which in turn supports shared-memory parallel programming in C/C++ and Fortran. OpenMP gives developers a simple and flexible interface for developing parallel applications that can really take advantage of a product like the 12-core “Magny-Cours” set in a multiprocessor system. This interface allows developers to write parallel code through a set of compiler directives, library routines, and environment variables. Perhaps best of all, the Open64 Compiler Suite has a large following in the form of a vibrant online community providing support, tips, and know-how for developers of all skill levels. Again, AMD Developer Central is the best source for additional information about Open64 and its myriad of capabilities.
AMD has collaborated with other partners over the years to tune their compiler tools for AMD Family 10h processors; these include Microsoft Visual Studio, PGI Compilers and Tools, Sun Studio Compilers, and GNU Compilers, among others.
This is just a small sampling of the varied tools that AMD has helped develop over recent years in order to enable end-users to take full advantage of the technologies inside their Family 10h processors. Stay tuned to the “Magny-Cours” Zone on AMD Developer Central for the latest tips for exploiting the capabilities of “Magny-Cours”.
Conclusion
The “Magny-Cours” processor with Direct Connect Architecture 2.0 comes with a bevy of new and improved technologies that allow more processor cores to communicate with each other, and the rest of the platform, faster and more efficiently than ever before. Users will immediately see improvements in raw performance, virtualization, and performance-per-watt over previous AMD products with this new platform. Software developers can leverage the free resources offered on developer.amd.com to take advantage of the increased core count and new improved software visible features.