Here we’ll take a look at how performance math libraries are a key tool for optimizing an HPC workload. Many HPC workloads in commercial and scientific computing rely on the classic set of Basic Linear Algebra Subroutines (BLAS). AMD’s implementation of BLAS, as well as other functions, is ACML (the AMD Core Math Library).
BLAS has its origins in Fortran, so the basic interface for ACML is a Fortran-based interface. That was why when compiling HPL, we started with Makefile.Linux_ATHLON_FBLAS which implies we are using a Fortran interface.
To take full advantage of AMD Family 10h processors, AMD reworked ACML, the kernel of which consists of a great deal of hand-tuned assembly code. The reworking takes into account the different performance characteristics of AMD Family 10h processors compared to AMD Family 0Fh processors (the microarchitecture for the original AMD Opteron). These differences include L2 cache size, the new L3 cache, and a different number of TLB entries.
Reviewing Make.Linux_ATHLON_FBLAS, this line shows the default compiler options selected:
CCFLAGS = $(HPL_DEFS) –fomit-frame-pointer –O3 –funroll-loops –W –Wall
This is a reasonable set of options for a baseline. In general for any software, there are a few more things that could be added. Without changing any of these compiler options, and using GCC 4.1, we’ll take a look at ACML Version 3.6.1 versus ACML Version 4.0.1 on a mixture of AMD Family 10h and Family 0Fh processors. We’ll change some of the parameters in hpccinf.txt to cover a broader range of matrix sizes. As we’ll see, significant performance gains will come from the use of an ACML tuned for AMD Family 10h processors.
The following table shows the uplift going from ACML Version 3.6.1 to ACML Version 4.0.1 on a variety of AMD-based systems. The systems “Sahara” and “Palomar” are from AMD Family 10h, the others are AMD Family 0Fh. These are the peak numbers in hpccoutf.txt for the ACML 4.0.1 runs divided by the corresponding numbers from ACML 3.6.1 runs. The metrics cited here are for the ones you would expect to be affected by ACML BLAS routines, in particular DGEMM and the High Performance Linpack.
ACML Version 4.0.1 uplift from ACML Version 3.6.1 in HPCC:
|
Platform (all tests with OpenMPI 1.2.5) |
Frequency in GHz |
Num Cores |
ACML 4.0.1 / 3.6.1 |
|
HP Tflops |
StarDGEMM Gflops |
SingleDGEMM Gflops |
|
2x Dual-Core Opteron 275 (Unifex) |
2.2 |
4 |
1.00 |
1.00 |
1.01 |
|
2x + 2x Dual-Core Opteron 275 (Unifex + Rogatien) |
2.2 |
8 |
1.00 |
1.00 |
1.01 |
|
4x Dual-Core Opteron 8216 (Kuhal) |
2.4 |
8 |
1.00 |
1.00 |
1.00 |
|
1x Quad- Core Phenom 9650 (Sahara) |
2.3 |
4 |
1.39 |
1.52 |
1.54 |
|
2x Quad- Core Opteron 2356 (Palomar) |
2.3 |
8 |
1.47 |
1.53 |
1.56 |
Several conclusions:
- ACML Version 4.0.1 offers significant benefits in improving performance on AMD Family 10h processors, including Third-Generation AMD Opteron and Phenom.
- AMD Family 10h uplift ranges from 39% to 56% for these tests.
- ACML Version 3.6.1 and ACML Version 4.0.1 are essentially equivalent on AMD Family 0Fh processors, with respect to these tests.
Next, let’s take a closer look at some of the data in a way to give us more insights into the hardware differences and the ACML differences. The next two graphs show High Performance Linpack performance for a) a baseline taken with ACML Version 3.6.1 and OpenMPI 1.2.5, and b) the use of ACML Version 4.0.1 also on OpenMPI 1.2.5.
Values of HPL “N” range from 5000 to 30000, stepping up by 5000. On the Phenom system, the HPL run only goes out to N=20000 because only 4 GB of RAM was available (this being a desktop platform).
This should give you a sense of how the peak performance rises for larger values of N.
ACML Version 3.6.1, OpenMPI 1.2.5
Sample Data for HPC Software Development Consideration

Figure 1.: Raw HPL performance data using ACML Version 3.6.1; Sample data only, not meant for competitive analysis
Several conclusions in looking at the ACML Version 3.6.1 data:
- Even without a library tuned for AMD Family 10h processors, the 2P Palomar system (AMD Family 10h, 2 2.3 GHz processors, 8 cores) out performs the 4P Kuhal system (AMD Family 0Fh, 4 2.4 GHz processors, 8 cores).
- Similarly, the 1P Sahara system (AMD Family 10h, 1 2.3 GHz processor, 4 cores), outperforms the 2P Unifex system (AMD Family 0Fh, 2 2.2 GHz processors, 4 cores). This is an interesting uplift even if you discount the 100 MHz clock frequency difference.
- The 4P Kuhal outperforms the networked combination of the 2P Unifex and Rogatien systems. (This is interesting but do not draw too much from this; some HPC-type workloads may not scale up on a 4P system, while others may. The 2P systems tend to predominate in the HPC market.)
ACML Version 4.0.1, OpenMPI 1.2.5
Sample Data for HPC Software Development Consideration

Figure 2.: Raw HPL performance data using ACML Version 4.0.1; Sample data only, not meant for competitive analysis
Several conclusions in looking at the ACML Version 4.0.1 data:
- With a library tuned for AMD Family 10h processors, the 2P Palomar system (AMD Family 10h, 2 2.3 GHz processors, 8 cores) significantly outperforms the 4P Kuhal system (AMD Family 0Fh, 4 2.4 GHz processors, 8 cores) – by about 60%.
- Not only that, but the 1P Sahara system (AMD Family 10h, 1 2.3 GHz processor, 4 cores), approaches the performance of the networked combination of 2P Unifex and 2P Rogatien systems (AMD Family 0Fh, 2 2.2 GHz processors, 8 cores).
- ACML 4 is key to getting the best performance on AMD Family 10h.
Some further commentary: as noted, the 4P Kuhal outperforms the networked combination of the 2P Unifex and Rogatien systems. Do not conclude too much from this. This a case of a workload that happens to work well for the MP systems; however this really gives a sense of the MPI interconnect performance on a shared memory interface versus that of a network connection (1 Gbps network switch). This should not really be a surprise that the MPI-based interprocess communication is faster over shared memory rather than over a network.
There are a lot of tradeoffs here in terms of cost and performance. The general trend in HPC has been to go to distributed networked systems since scalability is difficult on large MP systems. In large part this is due to the prevalence of workloads that are very memory intensive and are bottlenecked by memory bandwidth. Think of it this way; distributed systems give you lots of independent memory interfaces and memories with relatively less interference from cache coherence traffic.
» See Also: HPC Compiler-Flag Driven Performance Gains