AMD Toolchain with SPACK

Micro Benchmarks/Synthetic

SPACK HPC Applications


The STREAM benchmark is a simple, synthetic benchmark program that measures sustainable main memory bandwidth in MB/s and the corresponding computation rate for simple vector kernels.

The general rule for running STREAM is that each array must be at least 4x the size of the sum of all the last-level caches used in the run, or 1 Million elements, whichever is larger.

STREAM uses four kernels for analysis:

  1. ”Copy’ ‘ measures transfer rates in the absence of arithmetic.
  2. “Scale’ ‘ adds a simple arithmetic operation.
  3. “Sum’ ‘ adds a third operand to allow multiple load/store ports on vector machines to be tested.
  4. “Triad’ ‘ allows chained/overlapped/fused multiply/add operations.

Official website for STREAM:

Build STREAM using Spack

Reference to add external packages to Spack: Build Customization (Adding external packages to Spack)

# Format for Building STREAM
$ spack -d install -v stream@<Version> +openmp %aocc@<Version> cflags="CFLAGS"
# Example For  Building STREAM with AOCC 3.2.0
$ spack -d install -v stream@5.10 +openmp %aocc@3.2.0 cflags="-mcmodel=large -DSTREAM_TYPE=double -mavx2 -DSTREAM_ARRAY_SIZE=260000000 -DNTIMES=10 -ffp-contract=fast -fnt-store"
# Example For  Building STREAM with AOCC 3.1.0
$ spack -d install -v stream@5.10 +openmp %aocc@3.1.0 cflags="-mcmodel=large -DSTREAM_TYPE=double -mavx2 -DSTREAM_ARRAY_SIZE=260000000 -DNTIMES=10 -ffp-contract=fast -fnt-store"
# Example For  Building STREAM with AOCC 3.0.0
$ spack -d install -v stream@5.10 %aocc@3.0.0 +openmp cflags="-mcmodel=large -DSTREAM_TYPE=double -mavx2 -DSTREAM_ARRAY_SIZE=260000000 -DNTIMES=10 -ffp-contract=fast -fnt-store"
# Example: For Building STREAM with AOCC 2.3.0
$ spack -d install -v stream@5.10 %aocc@2.3.0 +openmp cflags="-mcmodel=large -DSTREAM_TYPE=double -mavx2 -DSTREAM_ARRAY_SIZE=2600000000 -DNTIMES=10 -ffp-contract=fast -fnt-store"
# Example For  Building STREAM with AOCC 2.2.0
$ spack -d install -v stream@5.10 %aocc@2.2.0 +openmp cflags="-mcmodel=large -DSTREAM_TYPE=double -mavx2 -DSTREAM_ARRAY_SIZE=2600000000 -DNTIMES=10 -ffp-contract=fast -fnt-store"

Compatibility of STREAM versions with AOCC versions is given below

Component/Application Versions Applicable
AOCC 3.2.0, 3.1.0, 3.0.0, 2.3.0, 2.2.0

Specifications and Dependencies

Symbol Meaning
-d To enable debug output
-v To enable verbose
@ To specify version number
% To specify compiler
+openmp To build with OPENMP enabled
cflags To add cflags to the Spack environment using command line

Basic Details of Flags used:

  • Mcmodel=large: Generate code for the large model. This model makes no assumptions about addresses and sizes of sections.
  • STREAM_ARRAY_SIZE= “260000000”: Sets the Array size for the STREAM benchmark. General recommendation is that “STREAM_ARRAY_SIZE” must be at least 4x the size of the sum of all the last-level caches in the system.
  • NTIMES=STREAM runs each kernel “NTIMES” times.
  • ffp-contract=fast enables floating-point expression contraction such as forming of fused multiply-add operations if the target has native support for them.
  • fnt-store= Generate non-temporal store instruction for array accesses in a loop with large trip count.

Running  Stream

These are the steps recommended to run STREAM on AMD processors:

  • STREAM generally gives the better performance with 1 thread per CCD.
  • Example binding options for AMD EPYC 7742 and AMD EPYC 7763 Processor to bind 1 thread per CCD: “export GOMP_CPU_AFFINITY=0-127:8”  and  “export OMP_NUM_THREADS=16”
Setting Environment
# Format for loading STREAM build with AOCC
$ spack load stream@<Version> %aocc@<Version>
# Example : Load STREAM build with AOCC 3.2.0 module into environment
$ spack load stream %aocc@3.2.0

Note: It is recommended to reboot the node for the optimal stream results.

Run Command
# Running STREAM:
$ spack load stream@5.10%aocc@3.2.0 
$ echo madvise | tee /sys/kernel/mm/transparent_hugepage/enabled
$ echo madvise | tee /sys/kernel/mm/transparent_hugepage/defrag
$ echo 3 > /proc/sys/vm/drop_caches
$ echo 1 > /proc/sys/kernel/numa_balancing
$ export OMP_SCHEDULE=static
$ export OMP_DYNAMIC=false
$ export OMP_THREAD_LIMIT=256
$ export OMP_STACKSIZE=256M
# Thread Binding Options for AMD EPYC 7742/7763 Processor
$ export GOMP_CPU_AFFINITY=0-127:8
$ echo "running for 1 thread per CCD"
$ stream_c.exe