Direct Connect Architecture
AMD64 with Direct Connect Architecture can improve overall system performance and efficiency by eliminating traditional bottlenecks inherent in legacy architectures.
Legacy front-side bus architectures restrict and interrupt the flow of data. Slower data flow means slower system performance. Interrupted data flow means reduced system scalability.
With Direct Connect Architecture, there are no front-side buses. Instead, the processors, memory controller, and I/O are directly connected to the CPU and communicate at CPU speed.
Direct Connect Architecture is available only with AMD64 processors, including the AMD Opteron and AMD Athlon processors, as well as with AMD Turion™ processors.
Exclusive features of Direct Connect Architecture include:
Integrated memory controller
AMD64 processors with Direct Connect Architecture feature an integrated, on-die memory controller, optimizing memory performance and bandwidth per CPU. AMD's memory bandwidth scales with the number of processors, compared to legacy designs that scale poorly because access to main memory is limited by external Northbridge chips.
HyperTransport technology
HyperTransport technology is a high-speed, bi-directional, low latency, point-to-point communication link that provides a scalable bandwidth interconnect between computing cores, I/O subsystems, and other chipsets. AMD Opteron processors support up to three coherent HyperTransport links, yielding up to 24.0 GB/s peak bandwidth per processor.
Third-Generation AMD Opteron vs. Second-Generation AMD Opteron Architecture
Listed in order of importance to HPC are the architectural features of Third-Generation AMD Opteron that primarily affect raw performance. These are the AMD Wide Floating Point Accelerator, the AMD Memory Optimizer Technology, and the AMD Balanced Smart Cache.
Taking advantage of these capabilities necessitates a combination of using later compilers and libraries optimized for the enhanced micro-architecture of Third-Generation AMD Opteron. Note that AMD Phenom processors have the same micro-architecture since both Phenom and Opteron are AMD Family 10h processors, and therefore those tools are also applicable to Phenom.
AMD Wide Floating Point Accelerator
The 128-bit SSE floating-point capabilities enable each processor to simultaneously execute up to four flops per clock per core (up to four times the floating-point computations of previous AMD Opteron processors). Instruction fetch bandwidth, data-cache bandwidth, and memory-controller-to-cache bandwidth have all been doubled over previous AMD Opteron processors to help keep the 128-bit floating-point pipeline full.
|
|
Second-Generation Opteron All Family 0Fh Processors |
Third-Generation Opteron All Family 10h AMD Processors |
|
SSE Execution Width |
64-bit |
128-bit + SSE MOVS |
|
Instruction Fetch Bandwidth |
16 bytes/cycle |
32 bytes/cycle + Unaligned load ops |
|
Data Cache Bandwidth |
2 x 64-bit loads/cycle |
2 x 128-bit loads/cycle |
|
L2/Northbridge Bandwidth |
64 bits/cycle |
128 bits/cycle |
|
FP Scheduler Depth |
36 Dedicated x 64-bit ops |
36 Dedicated x 128-bit ops |
Through enablement work with AMD technology partners, there are a variety of compilers already available to take advantage of these features. These include the version 7 and later PGI compiler, the Sun Studio 12 patches to the Sun compiler tools, and the GCC 4.1.2 compiler chain, as well as GCC 4.2.
Math libraries are the next primary considerations. AMD Core Math Library (ACML) Version 4 and later versions are tuned for these features.
Integrated DDR2 DRAM Controller with AMD Memory Optimizer Technology
AMD Family 10h processors have a 128-bit memory channel which can be divided into two independent 64-bit memory channels for improving memory access efficiency. The controller has larger memory buffers for increased throughput, as well as write bursting to minimize read/write transactions for greater throughput. In addition, there is an optimized DRAM paging algorithm to intelligently predict and retrieve data needed from main memory for greater throughput. Finally, core prefetchers can pull data directly from L1 cache to decrease latency and to spare L2 bandwidth.
AMD Balanced Smart Cache
AMD Family 10h processors have a large shared L3 cache which shares data between cores efficiently while helping reduce latency to main memory. A dedicated L1 and L2 cache per core helps performance of virtualized environments and large databases by reducing cache pollution associated with a shared L2 cache. The L1 cache of AMD Opteron processors can handle double the number of loads per cycle as Second-Generation AMD Opteron processors to help keep CPU cores busy.
Opteron's caches are in general exclusive: the data stored in L1, L2, and L3 do not duplicate, and hence the total cache size is the sum of all three caches. In addition, the cache coherency protocol is “write back” which reduces bus traffic by not involving the memory bus in exclusive and modified writes, and shared reads. This increases bandwidth, and thus enables the system to scale with additional processors and cores.