The formerly codenamed AMD “Magny-Cours” processor (part of the Family 10h processor family) introduces some key technology advancements that build on the foundation laid by preceding processors, formerly codenamed AMD “Barcelona” ,“Shanghai” and “Istanbul”. With “Barcelona,” we introduced an array of innovations in processor design and features, including native quad-core architectureand a new L3 cache shared across the processor cores. The AMD “Shanghai” release brought additional enhancements including improved scalability,availability and increased the L3 cache. The AMD “Istanbul” processor provided even more enhancements for software developers such as an even larger shared L3 cache, a total of six physical cores on die, a new probing filter called HT Assist to help increase bandwidth , several new power features as well as I/O virtualization. “Magny-Cours” adds even more cores, for a total of up to 12-cores per processor, as well as enhancing features such as power, virtualization anddirect connect architecture.There are a number of software visible features that can be leveraged to make your applications perform better and be ready to scale across multiple cores. Visit this page regularly for updated information and practical guidance on how to take advantage of all the new features in the latest Family 10h processors.
Software Development Tools and Resources
The following software development tools and resources have been optimized for Family 10h processors:
ACML is specifically designed to support multi-threading and other key features of AMD’s next-generation processors. ACML currently supports OpenMP, and features hand-tuned “Barcelona”, “Shanghai”, “Istanbul” and “Magny Cours” support for BLAS matrix multiplication routines, and the CFFT complex-complex Fast Fourier Transforms. The newly released ACML 4.4.0 includes further tuning of ZGEMM and real-complex FFTs.
The GNU Toolset, including the GCC compiler, the glibc project, and the binutils, have been optimized for AMD Family 10h processors.
The Visual Studio 2008 tools feature improved instruction selection, optimized register allocation, and enhanced 128-bit floating-point performance when used with AMD Family 10h processors.
The x86 Open64 compiler system is a high performance, production quality code generation tool designed for high performance parallel computing workloads. The x86 Open64 environment provides the developer the essential choices when building and optimizing C, C++, and Fortran applications targeting 32-bit and 64-bit Linux platforms. See all Optimized Partner Tools
Overview of Software Visible Features
Previous new feature flags for Family 10h functions :
- Fire & forget dynamic O/S P-state support
- Misaligned SSE access
- OS Visible workaround register
- Instruction-based sampling
- SVM lock
- Nested Paging
- L3 cache size
- 128-bit FPU
Feature identification bits for new instructions
- SSE4a Instructions
- Compiler Options Quick Reference Guide for AMD Opteron™ 6100 Series Processors (“Magny-Cours”) and AMD Opteron™ 4100 Series Processors (“Lisbon”)
- Software Optimization Guide for AMD Family 10h Processors
- Compiler Usage Guidelines for AMD64 Platforms
- CPUID Specification
- Revision Guide for AMD Family 10h Processors
- BIOS and Kernel Developer’s Guide (BKDG) For AMD Family 10h Processors
- “Bulldozer” and “Piledriver” Instructions Guide
- AMD64 Architecture Programmer’s Manual Volume 1: Application Programming
- AMD64 Architecture Programmer’s Manual Volume 2: System Programming
- AMD64 Architecture Programmer’s Manual Volume 3: General-Purpose and system Instructions
- AMD64 Architecture Programmer’s Manual Volume 4: 128-Bit Media Instructions
- AMD64 Architecture Programmer’s Manual Volume 5: 64-Bit Media and x87 Floating-Point Instructions
- See also the AMD Opteron 6100 video series for videos on performance, power, virtualization, and more.
Technical Articles & Blogs
There are several new features in power and virtualization, but the most prominent new feature is the increase in cores to 8 and 12 on each processor made possible by our Direct Connect Architecture. This technical article outlines what enhancements were made and how they will benefit your code.
Five years ago, AMD shook up the x86 processor by putting a memory controller directly on-chip. Now, AMD breaks new ground again with an innovative cache strategy.
New features in AMD’s upcoming Barcelona chip dramatically boost performance of floating-point arithmetic and greatly accelerate access to cache.
Take advantage of the many architectural innovations in the “Barcelona” processor through Orcas-based tools and AMD libraries.
AMD (Family 10h) Processor Software Visible Features blog series
Previous “Istanbul” blogs
Previous “Shanghai” blogs
- Transition from “Barcelona” to “Shanghai”
- Larger L3 Cache
- Improved Reliability, Availability, Scalability
Previous “Barcelona” blogs
- Shared L3 Cache
- Instruction-Based Sampling (IBS)
- SSE Misaligned Access
- SSE4a Instruction Set, Part 1
- SSE4a Instruction Set, Part 2
- Sideband Stack Optimizer
- 128-bit FPU
- Advanced Bit Manipulation (ABM)
Benchmarks and Performance Evaluations
This VMware performance white paper evaluating RVI performance with the Shanghai processor concludes that “the current VMware VMM leverages these features quite well, resulting in performance gains of up to 42% for MMU-intensive benchmarks and up to 500% for MMU-intensive microbenchmarks.”
HP ProLiant DL585 G5 earns #1 virtualization performance record on VMmark benchmark.
The very first independent Nested Paging Virtualization tests (2 socket servers running Xen with database and web serving workloads and featuring AMD-V (RVI)).
“Jaguar,” the AMD Opteron-based system by Cray at Oak Ridge National Labs, is the first entirely x86-based system to break the Petaflop barrier.
HP ProLiant DL585 G5 and DL385 G5 AMD Opteron servers lead with 4P, 2P world record performances on the SPECweb®2005 Benchmark.
(Please note that Dual-Core AMD Opteron processors also hold the SPECWeb2005 performance records for 2P and 4P servers.)
An 8 socket Shanghai-based HP system achieves the top x86-based score with Oracle and a 2 socket Shanghai-based HP system achieves the top x86-based score with SQL Server 2005.
AnandTech is “quite surprised that Shanghai was able to meet and, in some cases, pass Harpertown at various workload levels in some of the benchmarks.”
HP ProLiant DL585 G5 with Quad-Core AMD Opteron processors takes #1 4-socket worldwide price/performance record again on TPC-C benchmark.
HP ProLiant BL465c G5 server blade posts HP’s first Quad-Core AMD Opteron™ blade result on Oracle Applications Standard Benchmark (small model, single DB instance).
HP ProLiant DL585 G5 achieves #1 4-processor Windows result on two-tier SAP® Sales and Distribution Standard Application Benchmark.
HP ProLiant DL785 G5 takes #1 8-processor Windows result with new Quad-Core AMD Opteron™ processors on two-tier SAP® Sales and Distribution Standard Application Benchmark.
HP ProLiant servers show excellent performance scalability with new Quad-Core AMD Opteron processors on two-tier SAP® Sales and Distribution (SD) Standard Application Benchmark (2 socket and 4 socket blades and servers).
Java Application Serving
Quad-Core AMD Opteron processor-based Sun X4600 server sets x86 SPECjbb2005 world record (8 socket server).
Floating Point Performance
HP ProLiant DL585 G5 server with latest Quad-Core AMD Opteron™ processors takes overall x86_64 records on SPEC® CPU2006 benchmark.