Navigation

Spack

AMD Toolchain with SPACK

Micro Benchmarks/Synthetic

SPACK HPC Applications

Introduction

HPL is a software package that solves a (random) dense linear system in double precision (64 bits) arithmetic on distributed-memory computers. It can thus be regarded as a portable as well as freely available implementation of the High Performance Computing Linpack Benchmark.

The algorithm used by HPL can be summarized by the following keywords: Two-dimensional block-cyclic data distribution – Right-looking variant of the LU factorization with row partial pivoting featuring multiple look-ahead depths – Recursive panel factorization with pivot search and column broadcast combined – Various virtual panel broadcast topologies – bandwidth reducing swap-broadcast algorithm – backward substitution with look-ahead of depth 1.

Official Page for HPL : https://www.netlib.org/benchmark/hpl/

Build HPL using Spack

Reference to add external packages to Spack: Build Customization (Adding external packages to Spack)

# Format for Building HPL
$ spack -d install -v  hpl@<Version Number> %aocc@<Version Number> +openmp target=<zen2/zen3> cflags="CFLAGS" ^amdblis@<Version Number> threads=openmp ^openmpi@<Version Number> fabrics="knem" ^knem%gcc@<GCC Version> target=<zen2/zen3>
# Example: For Building HPL-2.3 with AOCC-3.1.0 and AOCL 3.0
$ spack -d install -v hpl@2.3 %aocc@3.1.0 +openmp target=zen3 cflags="-O3" ^amdblis@3.0  threads=openmp ^openmpi@4.0.5 fabrics="knem" ^knem%gcc@8.3.1 target=zen
# Example: For Building HPL-2.3 with AOCC-3.0.0 and AOCL 3.0
$ spack -d install -v hpl@2.3 %aocc@3.0.0 +openmp target=zen3 cflags="-O3" ^amdblis@3.0  threads=openmp ^openmpi@4.0.3 fabrics="knem" ^knem%gcc@8.3.1 target=zen
# Example: For Building HPL-2.3 with AOCC-2.3.0 and AOCL 2.2
$ spack -d install -v hpl@2.3 %aocc@2.3.0 +openmp target=zen2 cflags="-O3" ^amdblis@2.2  threads=openmp ^openmpi@4.0.3 fabrics="knem" ^knem%gcc@8.3.1 target=zen
# Example: For Building HPL-2.3 with AOCC-2.2.0 and AOCL 2.2
$ spack -d install -v  hpl@2.3 %aocc@2.2.0 +openmp target=zen2 cflags="-O3" ^amdblis@2.2 threads=openmp ^openmpi@4.0.3 fabrics="knem" ^knem%gcc@9.2.0 target=zen2

Note: KNEM is a kernel module which needs to be installed using the GCC compilers always. In the above commands for the HPL build, KNEM is compiled with GCC 8.3.1 or 9.2.  Please change the GCC version in your Spack build command to match the preferred GCC available on your system.  To check the available compilers, use the command “spack compilers“.  Note that the KNEM “target” option needs to set to the GCC version used.  For GCC version 8 and lower, use “target=zen”.  For GCC version 9 and higher, use “target=zen2”.

Please use any combination of below components/Applications and its versions.

Component/Application Versions Applicable
AOCC 3.1.0, 3.0.0, 2.3.0, 2.2.0
AOCL 3.0, 2.2
HPL 2.3

Specifications and Dependencies

Symbol Meaning
-d To enable debug output
-v To enable verbose
@ To specify version number
% To specify compiler
+openmp To build with openmp enabled
cflags To add cflags to the Spack environment using command line

Running HPL

To run HPL

  1. Create run scripts ( run_hpl_ccx.sh ) that bind the MPI process to the proper AMD processor Core Complex Die (CCD) or Core Complex (CCX) that are related to their local L3 cache memory.  The script “run_hpl_ccx.sh” requires two additional files: “appfile_ccx” and “xhpl_ccx.sh”.
  2. Create a bash script in your work directory with the code snippets provided in sections “run_hpl_ccx.sh”, “appfile_ccx” and “xhpl_ccx.sh”.
  3. Create or update the HPL.dat file that is based on the underlying machine architecture.

Examples for running HPL on AMD 2nd Gen EPYC Series and AMD 3rd Gen EPYC Series processors are provided below

Running HPL On AMD 2nd Gen EPYC Processors

run_hpl_ccx.sh
#! /bin/bash
# Setup Spack environment
source <spack_location>/spack/share/spack/setup-env.sh
# To load HPL into environment
spack load hpl %aocc@3.0.0
ldd `which xhpl`
which mpicc
sleep 10
# Run the appfile as root, which specifies 16 processes, each with its own CPU binding for OpenMP
# set the CPU governor to performance

sudo cpupower frequency-set -g performance
# Verify the knem module is loaded
lsmod | grep -q knem
if [ $? -eq 1 ]; then
      echo "Loading knem module..."
      sudo modprobe -v knem
fi
mpi_options="--mca mpi_leave_pinned 1 --bind-to none --report-bindings --mca btl self,vader"
mpi_options="$mpi_options --map-by ppr:1:l3cache -x OMP_NUM_THREADS=4 -x OMP_PROC_BIND=TRUE -x OMP_PLACES=cores"
mpirun $mpi_options -app ./appfile_ccx

 

xhpl_ccx.sh
#! /bin/bash
#
# Bind memory to node $1 and four child threads to CPUs specified in $2
#
# Kernel parallelization is performed at the 2nd innermost loop (IC)

export LD_LIBRARY_PATH=$BLISROOT/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$OPENMPIROOT/lib:$LD_LIBRARY_PATH
export OMP_NUM_THREADS=$3
export GOMP_CPU_AFFINITY="$2"
export OMP_PROC_BIND=TRUE
# BLIS_JC_NT=1 (No outer loop parallelization):
export BLIS_JC_NT=1
# BLIS_IC_NT= #cores/ccx (# of 2nd level threads – one per core in the shared L3 cache domain):
export BLIS_IC_NT=$OMP_NUM_THREADS
# BLIS_JR_NT=1 (No 4th level threads):
export BLIS_JR_NT=1
# BLIS_IR_NT=1 (No 5th level threads):
export BLIS_IR_NT=1
numactl –membind=$1 xhpl

Sample appfile_ccx file for AMD EPYC 7742 Processor model.

appfile_ccx
-np 1 ./xhpl_ccx.sh 0 0-3 4
-np 1 ./xhpl_ccx.sh 0 4-7 4
-np 1 ./xhpl_ccx.sh 0 8-11 4
-np 1 ./xhpl_ccx.sh 0 12-15 4
-np 1 ./xhpl_ccx.sh 1 16-19 4
-np 1 ./xhpl_ccx.sh 1 20-23 4
-np 1 ./xhpl_ccx.sh 1 24-27 4
-np 1 ./xhpl_ccx.sh 1 28-31 4
-np 1 ./xhpl_ccx.sh 2 32-35 4
-np 1 ./xhpl_ccx.sh 2 36-39 4
-np 1 ./xhpl_ccx.sh 2 40-43 4
-np 1 ./xhpl_ccx.sh 2 44-47 4
-np 1 ./xhpl_ccx.sh 3 48-51 4
-np 1 ./xhpl_ccx.sh 3 52-55 4
-np 1 ./xhpl_ccx.sh 3 56-59 4
-np 1 ./xhpl_ccx.sh 3 60-63 4
-np 1 ./xhpl_ccx.sh 4 64-67 4
-np 1 ./xhpl_ccx.sh 4 68-71 4
-np 1 ./xhpl_ccx.sh 4 72-75 4
-np 1 ./xhpl_ccx.sh 4 76-79 4
-np 1 ./xhpl_ccx.sh 5 80-83 4
-np 1 ./xhpl_ccx.sh 5 84-87 4
-np 1 ./xhpl_ccx.sh 5 88-91 4
-np 1 ./xhpl_ccx.sh 5 92-95 4
-np 1 ./xhpl_ccx.sh 6 96-99 4
-np 1 ./xhpl_ccx.sh 6 100-103 4
-np 1 ./xhpl_ccx.sh 6 104-107 4
-np 1 ./xhpl_ccx.sh 6 108-111 4
-np 1 ./xhpl_ccx.sh 7 112-115 4
-np 1 ./xhpl_ccx.sh 7 116-119 4
-np 1 ./xhpl_ccx.sh 7 120-123 4
-np 1 ./xhpl_ccx.sh 7 124-127 4

HPL.dat

Please change the following values as per your system configuration for “AMD EPYC 7002 Series Processors”.

Ns=Ns is the size of your problem, and usually the goal is to find the largest problem size that would fit in your system’s memory.

PsxQs=(P*Q) is the size of your grid which is equal to the number of processors your cluster has.

Sample HPL.dat for two-socket AMD EPYC 7742 Processor with 1024GB (1TB) of memory.

HPL.dat
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out     output file name (if any)
6           device out (6=stdout,7=stderr,file)
1           # of problems sizes (N)
341040       Ns
1           # of NBs
240         # of problems sizes (N)
0           MAP process mapping (0=Row-,1=Column-major)
1           # of process grids (P x Q)
4           Ps
8           Qs
16.0        threshold
1           # of panel fact<
2           PFACTs (0=left, 1=Crout, 2=Right)
1           # of recursive stopping criterium
4           NBMINs (>= 1)
1           # of panels in recursion
2           NDIVs
1           # of recursive panel fact.
1           RFACTs (0=left, 1=Crout, 2=Right)
1           # of broadcast
1           BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1           # of lookahead depth
1           DEPTHs (>=0)
2           SWAP (0=bin-exch,1=long,2=mix)
64          swapping threshold
0           L1 in (0=transposed,1=no-transposed) form
0           U in (0=transposed,1=no-transposed) form
1           Equilibration (0=no,1=yes)
8           memory alignment in double (> 0)

To run HPL, please use $ ./run_hpl_ccx.sh”

Running HPL On AMD 3rd Gen EPYC Processors

run_hpl_ccx.sh
#!/usr/bin/env bash
# spack load openmpi
# To load the Openmpi modules (To load specific OpenMPI use the # value. spack load openmpi /udyvzcb )

spack load hpl %aocc@3.1.0
ldd xhpl
which mpicc
sleep 10

# Run the appfile as root, which specifies 16 processes, each with its own CPU binding for OpenMP
# set the CPU governor to performance

sudo cpupower frequency-set -g performance
# Verify the knem module is loaded

lsmod | grep -q knem
if [ $? -eq 1 ]; then
echo "Loading knem module..."
sudo modprobe -v knem
fi
mpi_options="--allow-run-as-root --mca mpi_leave_pinned 1 --bind-to none --report-bindings --mca btl self,vader --map-by ppr:1:l3cache -x OMP_NUM_THREADS=8 -x OMP_PROC_BIND=TRUE -x OMP_PLACES=cores"
mpirun $mpi_options -app ./appfile_ccx

 

xhpl_ccx.sh
#! /usr/bin/env bash

#
# Bind memory to node $1 and four child threads to CPUs specified in $2
#
# Kernel parallelization is performed at the 2nd innermost loop (IC)
export LD_LIBRARY_PATH=$BLISROOT/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$OPENMPIROOT/lib:$LD_LIBRARY_PATH
export OMP_NUM_THREADS=$3
export GOMP_CPU_AFFINITY="$2"
export OMP_PROC_BIND=TRUE
# BLIS_JC_NT=1 (No outer loop parallelization):
export BLIS_JC_NT=1
# BLIS_IC_NT= #cores/ccx (# of 2nd level threads – one per core in the shared L3 cache domain):
export BLIS_IC_NT=$OMP_NUM_THREADS
# BLIS_JR_NT=1 (No 4th level threads):
export BLIS_JR_NT=1
# BLIS_IR_NT=1 (No 5th level threads):
export BLIS_IR_NT=1
numactl --membind=$1 xhpl

Sample appfile_ccx file for AMD EPYC 7763 Processor

appfile_ccx
-np 1 ./xhpl_ccx.sh 0 0-7 8
-np 1 ./xhpl_ccx.sh 0 8-15 8
-np 1 ./xhpl_ccx.sh 1 16-23 8
-np 1 ./xhpl_ccx.sh 1 24-31 8
-np 1 ./xhpl_ccx.sh 2 32-39 8
-np 1 ./xhpl_ccx.sh 2 40-47 8
-np 1 ./xhpl_ccx.sh 3 48-55 8
-np 1 ./xhpl_ccx.sh 3 56-63 8
-np 1 ./xhpl_ccx.sh 4 64-71 8
-np 1 ./xhpl_ccx.sh 4 72-79 8
-np 1 ./xhpl_ccx.sh 5 80-87 8
-np 1 ./xhpl_ccx.sh 5 88-95 8
-np 1 ./xhpl_ccx.sh 6 96-103 8
-np 1 ./xhpl_ccx.sh 6 104-111 8
-np 1 ./xhpl_ccx.sh 7 112-119 8
-np 1 ./xhpl_ccx.sh 7 120-127 8

HPL.dat

Please change the following values as per your system configuration for “AMD EPYC 7003 Series Processors”.

Ns=Na is the size of your problem, and usually the goal is to find the largest problem size that would fit in your system’s memory.

PsxQs=(P*Q) is the size of your grid which is equal to the number of processors your cluster has.

Sample HPL.dat for two-socket AMD EPYC 7763 Processor with 1024GB (1TB) of memory.

HPL.dat
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out     output file name (if any)
6           device out (6=stdout,7=stderr,file)
1           # of problems sizes (N)
362493      Ns
1           # of NBs
224         # of problems sizes (N)
0           MAP process mapping (0=Row-,1=Column-major)
1           # of process grids (P x Q)
2           Ps
8           Qs
16.0        threshold
1           # of panel fact<
2           PFACTs (0=left, 1=Crout, 2=Right)
1           # of recursive stopping criterium
4           NBMINs (>= 1)
1           # of panels in recursion
2           NDIVs
1           # of recursive panel fact.
1           RFACTs (0=left, 1=Crout, 2=Right)
1           # of broadcast
1           BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1           # of lookahead depth
1           DEPTHs (>=0)
2           SWAP (0=bin-exch,1=long,2=mix)
64          swapping threshold
0           L1 in (0=transposed,1=no-transposed) form
0           U in (0=transposed,1=no-transposed) form
1           Equilibration (0=no,1=yes)
8           memory alignment in double (> 0)

To run HPL, please use $ ./run_hpl_ccx.sh”