AMD Logo AMD Developer Central

Hardware for Parallel Computing
Skip Navigation LinksHome > Tech Zones > HPC Zone > Hardware for Parallel Computing

Overview
Here we'll dive into some considerations related to High Performance Computing (HPC) software development for the developer programming, building, and using software on AMD-based systems.

First, we'll cover some general parallel programming model basics, and systems designed to support those models, talk a little bit about the tools space, and then go into some detail on AMD's architecture, and how it supports HPC workloads.

This will provide some background for the Tutorials  that will provide some hands on instruction on how to install some tools and libraries that will be used to illustrate some best core programming practices for taking advantage of AMD's architecture. Also in the Tutorials we will further develop and expand upon these core programming practices, and will include tutorials on memory management, code generation, NUMA optimization, file I/O, and networking.

» CPU Evolution and Performance Considerations
» Processor versus Socket
» SMP versus SMP versus NUMA
» Clusters versus Grids
» Network Interconnects
» Remote Data Shared Memory Architectures
» AMD Architectural Solutions

CPU Evolution and Performance Considerations
The evolution of general purpose CPUs has been highly-influenced by the needs of the efficiency of floating point computation in High Performance Computing and supercomputing. This has meant the ability to use vectorization ISAs, which can also be called instruction level parallelization, and has been supported by the development of superscalar CPU architectures (meaning more than one instruction can be executed per machine cycle). From the programmer's standpoint superscalar architectures also need a super-pipelined out-of-order processor (such as AMD Opteron or AMD Athlon ™ ). This allows the hardware to schedule micro-operations optimally and improve efficiency with less burden on the programmer. Note however that peak performance is easiest to achieve with domain knowledge and assistance from the programmer.

The 1990s saw the prevalent adoption of x86-based multi-CPU computing, primarily first at the server and workstation level, and in the 2000s moving into the home personal computer, first with high-end dual-socket PCs, then finally the addition of multi-core. In all of these cases, the programming challenges for parallelization remain essentially the same. The operating system challenges for SMP are the same for four-socket single-core servers as well as single-socket four-core desktops. However, the performance of these different types of systems has a number of complications.

One complication for performance is that computation has always been bottlenecked by memory latency. This is very painful in HPC because that audience is typically interested in double-precision floating point calculations that require not just a fast FPU, but the lowest possible latency to main memory. Latency can be improved by providing on-die or off-die CPU caches, a faster system bus, or AMD's approach of an integrated on-die memory controller. On SMP (multi-CPUs on a single system) memory latency and bandwidth is an additional challenge because in many cases one or two CPUs or cores can saturate and use the available memory bandwidth.

Arguably another way of thinking about the memory latency challenge is to solve with multiple memory systems and multiple memory interfaces; that is distributed computation across multiple SMP systems. This means distributing calculations and data through an interface such as MPI on remote systems, taking advantage of those remote systems CPU and memory bandwidth, and then pulling it back in. Obviously at that point then computational efficiency becomes tied to the performance of the network interconnects used.

Below, we'll elaborate on some of the AMD architectural innovations that provide solutions to bottlenecks in HPC and in general purpose computing. Before that, however, we'll spend a little more time on other concepts and basic terminology for HPC hardware.


Processor versus Socket
A processor or CPU is something that a processor manufacturer sells as a discrete “part.” A socket is simply a connection on a motherboard that a processor is plugged into. Usually, the number of processors and number of sockets is the same when an OEM sells a computer system. Confusion arises when someone asks “how many CPUs?” for which the best answer might really be the total number of cores, say 32 for an eight-socket system with eight Quad-Core AMD Opteron processors. This is because an OS or application maps each core to its concept of a CPU for historical reasons and there's no reason to change that concept. However in this case, from the processor manufacturer standpoint, the answer is eight.

SMP versus SMP versus NUMA
There have been two common definitions of the SMP acronym. Both can be considered correct depending on the context. The simplest term that happens to be generally applicable for AMD-based hardware is Shared Memory multi-Processing. This means that multiple CPUs (or multiple-cores) share access to memory on the same system. The alternative term is Symmetric Multi-Processing; this also means multiple CPUs (or multiple-cores) share access to memory on the same system, but it also means that the speed and latency of each CPU to the memory is the same.

AMD-based servers and workstations are Non-Uniform Memory Access (NUMA) because the means that the speed and latency of each CPU to the memory is the not the same. So, “Symmetric” does not apply, but “Shared” does. The AMD approach is detailed further later on. Suffice it to say there have been many vendors including IBM, Unisys, and Sun Microsystems that have supplied Shared Memory multi-Processing systems with a NUMA architecture. The term “locality of reference” is relevant to NUMA and any hierarchical memory system (L1 cache, L2 cache, L3 cache, RAM, and so forth).


Clusters versus Grids
As stated before, clusters have come to mean collections of SMP systems tightly networked together. Typically, most folks would want each system to be homogenous – the same – for acquisition and administrative purposes, and this is generally true of HPC clusters. On the other hand a grid can according to some definitions consist of heterogeneous systems in which not only are the systems not the same, but the CPUs and operating systems can be different. A good example of this is SETI@Home, part of the BOINC project, an open-source platform for distributed computing depending on volunteers. Commercially, Sun Microsystems supplies the Sun Grid Compute Utility which supplies grid computation on demand, working from a catalog of software applications.

Network Interconnects
Interconnects are fundamentally hardware; network cards (NICs), link cables made out of copper or optical fiber, switches, hubs, routers, etcetera. Different implementation approaches to these fundamentals are network technologies such as Infiniband, InfiniPath, Myrinet, and of course Ethernet. There are a variety of vendors supplying hardware and software solutions based on these technologies. For example, Scali provides Scali MPI Connect which is a tuned MPI library implementation that can work well with the above network technologies. Myrinet is provided by Myricom.

Remote Data Shared Memory Architectures
For the distributed programmer, it would be nice to access remote memory more transparently, that is, something like “malloc” instead of the programming effort of MPI. A technology used to that end on some systems (referred to as NUMA computer clusters) is the Scalable Coherent Interface (SCI). Intended to become a bus standard, it was not widely adopted but there are solution providers such as Dolphin Interconnect Solutions who provide the technology. The key thing to recognize here is this type of technology can in some ways simplify the programming model.

Related to this is the Global Arrays Shared (GAS) approach. This can be deployed on a variety of environments, including MPI (simple message passing networking) or a NUMA computer cluster using SCI. The point of GAS is less programming overhead versus MPI, but more explicit control and awareness of locality of reference.


AMD Architectural Solutions
Direct Connect Architecture

AMD64 with Direct Connect Architecture can improve overall system performance and efficiency by eliminating traditional bottlenecks inherent in legacy architectures.

Legacy front-side bus architectures restrict and interrupt the flow of data. Slower data flow means slower system performance. Interrupted data flow means reduced system scalability.

With Direct Connect Architecture, there are no front-side buses. Instead, the processors, memory controller, and I/O are directly connected to the CPU and communicate at CPU speed.

Direct Connect Architecture is available only with AMD64 processors, including the AMD Opteron and AMD Athlon processors, as well as with AMD Turion™ processors.

Exclusive features of Direct Connect Architecture include:

Integrated memory controller

AMD64 processors with Direct Connect Architecture feature an integrated, on-die memory controller, optimizing memory performance and bandwidth per CPU. AMD's memory bandwidth scales with the number of processors, compared to legacy designs that scale poorly because access to main memory is limited by external Northbridge chips.

HyperTransport technology

HyperTransport technology is a high-speed, bi-directional, low latency, point-to-point communication link that provides a scalable bandwidth interconnect between computing cores, I/O subsystems, and other chipsets. AMD Opteron processors support up to three coherent HyperTransport links, yielding up to 24.0 GB/s peak bandwidth per processor.

Third-Generation AMD Opteron vs. Second-Generation AMD Opteron Architecture

Listed in order of importance to HPC are the architectural features of Third-Generation AMD Opteron that primarily affect raw performance. These are the AMD Wide Floating Point Accelerator, the AMD Memory Optimizer Technology, and the AMD Balanced Smart Cache.

Taking advantage of these capabilities necessitates a combination of using later compilers and libraries optimized for the enhanced micro-architecture of Third-Generation AMD Opteron. Note that AMD Phenom processors have the same micro-architecture since both Phenom and Opteron are AMD Family 10h processors, and therefore those tools are also applicable to Phenom.

AMD Wide Floating Point Accelerator

The 128-bit SSE floating-point capabilities enable each processor to simultaneously execute up to four flops per clock per core (up to four times the floating-point computations of previous AMD Opteron processors). Instruction fetch bandwidth, data-cache bandwidth, and memory-controller-to-cache bandwidth have all been doubled over previous AMD Opteron processors to help keep the 128-bit floating-point pipeline full.

 

Second-Generation Opteron
All Family 0Fh Processors

Third-Generation Opteron
All Family 10h AMD Processors

SSE Execution Width

64-bit

128-bit + SSE MOVS

Instruction Fetch Bandwidth

16 bytes/cycle

32 bytes/cycle + Unaligned load ops

Data Cache Bandwidth

2 x 64-bit loads/cycle

2 x 128-bit loads/cycle

L2/Northbridge Bandwidth

64 bits/cycle

128 bits/cycle

FP Scheduler Depth

36 Dedicated x 64-bit ops

36 Dedicated x 128-bit ops

Through enablement work with AMD technology partners, there are a variety of compilers already available to take advantage of these features. These include the version 7 and later PGI compiler, the Sun Studio 12 patches to the Sun compiler tools, and the GCC 4.1.2 compiler chain, as well as GCC 4.2.

Math libraries are the next primary considerations. AMD Core Math Library (ACML) Version 4 and later versions are tuned for these features.

Integrated DDR2 DRAM Controller with AMD Memory Optimizer Technology

AMD Family 10h processors have a 128-bit memory channel which can be divided into two independent 64-bit memory channels for improving memory access efficiency. The controller has larger memory buffers for increased throughput, as well as write bursting to minimize read/write transactions for greater throughput. In addition, there is an optimized DRAM paging algorithm to intelligently predict and retrieve data needed from main memory for greater throughput. Finally, core prefetchers can pull data directly from L1 cache to decrease latency and to spare L2 bandwidth.

AMD Balanced Smart Cache

AMD Family 10h processors have a large shared L3 cache which shares data between cores efficiently while helping reduce latency to main memory. A dedicated L1 and L2 cache per core helps performance of virtualized environments and large databases by reducing cache pollution associated with a shared L2 cache. The L1 cache of AMD Opteron processors can handle double the number of loads per cycle as Second-Generation AMD Opteron processors to help keep CPU cores busy.

Opteron's caches are in general exclusive: the data stored in L1, L2, and L3 do not duplicate, and hence the total cache size is the sum of all three caches. In addition, the cache coherency protocol is “write back” which reduces bus traffic by not involving the memory bus in exclusive and modified writes, and shared reads. This increases bandwidth, and thus enables the system to scale with additional processors and cores.