- Spack usage disclaimer, copyright and trademark notice
- Introduction to SPACK
- Getting Started
- Build Customization
- Technical Support
AMD Toolchain with SPACK
SPACK HPC Applications
The STREAM benchmark is a simple, synthetic benchmark program that measures sustainable main memory bandwidth in MB/s and the corresponding computation rate for simple vector kernels.
The general rule for running STREAM is that each array must be at least 4x the size of the sum of all the last-level caches used in the run, or 1 Million elements, whichever is larger
STREAM uses four kernels for analysis:
- “Copy” measures transfer rates in the absence of arithmetic.
- “Scale” adds a simple arithmetic operation.
- “Sum” adds a third operand to allow multiple load/store ports on vector machines to be tested.
- “Triad” allows chained/overlapped/fused multiply/add operations.
Official website for STREAM: https://www.cs.virginia.edu/stream/
Build STREAM using Spack
Reference to add external packages to Spack: Build Customization (Adding external packages to Spack)
Compatibility of STREAM versions with AOCC versions is given below
|AOCC||3.0.0, 2.3.0, 2.2.0|
Specifications and Dependencies
|-d||To enable debug output|
|-v||To enable verbose|
|@||To specify version number|
|%||To specify compiler|
|+openmp||To build with OPENMP enabled|
|cflags||To add cflags to the Spack environment using command line|
Basic Details of Flags used:
- Mcmodel=large: Generate code for the large model. This model makes no assumptions about addresses and sizes of sections.
- STREAM_ARRAY_SIZE= “2500000000”: Sets the Array size for the STREAM benchmark. General recommendation is that “STREAM_ARRAY_SIZE” must be at least 4x the size of the sum of all the last-level caches in the system.
- NTIMES=STREAM runs each kernel “NTIMES” times.
- ffp-contract=fast enables floating-point expression contraction such as forming of fused multiply-add operations if the target has native support for them.
- fnt-store= Generate non-temporal store instruction for array accesses in a loop with large trip count.
These are the steps recommended to run STREAM on AMD processors:
- STREAM generally gives the better performance with 1 thread per CCD.
- Example binding options for AMD EPYC 7742 and AMD EPYC 7763 Processor to bind 1 thread per CCD: “export GOMP_CPU_AFFINITY=0-127:8” and “export OMP_NUM_THREADS=16”