AMD Phenom II X4 965 CPU Results
As we discussed earlier, a very well optimized implementation of this code for the AMD Phenom II X4 965 CPU might achieve 10.5 SP GFLOP/s on this computation. The upper bound on performance is 10.6 SP GFLOP/s since the processor has 21.3 GB/s of memory bandwidth, assuming perfect caching of the vector and offset array.
After applying our optimizations, except for the use of OpenCL™ images to cache the vector, since the ATI Stream SDK v2.1 does not currently support images on x86 CPUs, we reach 2.9 SP GFLOP/s, as shown in Figure 12. Although we only achieved 27% of bound, OpenCL™ still enabled us to run the same code and utilize all our cores.
Figure 12: Vectorized AMD Phenom II X4 965 CPU Results
OpenCL™ and the OpenCL™ logo are trademarks of Apple Inc. used by permission by Khronos.