I am often asked how to compare performance of a software implementation that uses OpenCL™ to that of a “native CPU” implementation that uses only pure C++, possibly also using Integrated Performance Primitives (IPP). This comparison is straightforward in OpenCV 3.0, a popular computer vision library that AMD has supported since 2011, and I will explain how to do that shortly.
Transparent Acceleration via OpenCL
First, let me give you a brief introduction to OpenCV 3.0. In OpenCV 3.0, the library supports transparent acceleration via OpenCL. At runtime, if OpenCL is available and not disabled, OpenCL will be used by default (and preferentially), if the algorithm has an OpenCL implementation. Just like 2.4, there are plenty of algorithms with OpenCL implementations, especially in the imgproc module!
Enabling or disabling OpenCL is controlled globally via an environment variable, which you should set or clear before you run the performance tests, or the samples. In this blog I’m using Windows 7, 64 bit OS. Modifications for other platforms should be fairly obvious, to those skilled in the art!
To disable OpenCL (to enable pure native runs) do:
To re-enable OpenCL (remember, it was enabled by default, before you disabled it with the line above), you need to clear the environment variable, for example:
You can also specify a particular OpenCL device for the run, for example:
The Transparent API
Back to the transparent API: it enables you to unify, in a single code base, native and OpenCL-accelerated programming. You write your code only once! Gone are the days (of OpenCV 2.4), where you had to use different functions to enable an OpenCL run. In fact the “ocl” namespace and folder are gone, and so is the “oclMat.” There is a new unified data structure—the UMat—that handles data transfers if needed.
All code under the transparent API (“T-API”, “T.API”) needs to be numerically equivalent. The accuracy tests enforce that. This makes perfect sense. You don’t want to have different results depending on which platform you run your code on.
Library developers who wish to enhance the library should implement new functionality in both OpenCL and C++, and should add accuracy and performance tests. Comparing the C++ and OpenCL results is a good sanity check anyway!
Library users can just declare their variables as UMat type, and reap the benefit of transparent acceleration that works on all platforms supporting OpenCL (including discrete and integrated GPUs).
With that understanding, let’s get back to performance testing.
First, use the master branch from https://github.com/itseez/opencv to get the code. Configure cmake to generate code for your platform. OpenCV now provides IPP binaries, and IPP is enabled by default. This is great, because you can compare OpenCL and IPP directly on various platforms, and draw your own conclusions! You can also configure cmake to use multi-threading of your choice. MS concurrency is enabled by default.
If you are planning to compare data from many different platforms, it pays to be systematic and organized, in terms of naming conventions. I recommend that you name the output directory that cmake uses for the targets according to the options enabled in cmake. For example, a good naming convention for your “buildDir” is:
Replace the names in brackets with your configuration. For example, in the above [Compiler] might be VS2013, [cmake_options] could be ocl_ipp, or ocl_noipp, and so on. [Arch] might be “x86”, “x64” etc.
You should always leave
WITH_OPENCL turned on (which is the default), otherwise you will not be able to compare the results of your tests!
After you generate the code in cmake, you need to build it (to state the obvious). Observe the binaries in [buildDir]\bin\Release. There are performance tests per module, with naming conventions as follows:
We will use the image processing module, “imgproc,” as an example. The name of the performance test is, yes, you guessed it, opencv_perf_imgproc.exe!
Then, you need the test data. Get them from https://github.com/itseez/opencv_extra. You should use the “master” branch.
After you get the test data, you need to set an environment variable so that OpenCV performance tests will find the data directory:
set OPENCV_TEST_DATA_PATH=[path to opencv_extra-master]\testdata
Once again to state the obvious, use your actual path.
We can now run the test, and output the results to a file. Here again, for the purpose of comparison across runs, it pays to be systematic with the naming conventions. The naming convention I use is:
[ocv module]-[platform]-[cmake options]-[runtime options]-[arch]
For example, [ocv module] might be “imgproc”; [platform] might be “KV35W” for a 35W Kaveri; [cmake options] and [arch] will be as above; [runtime options] can be something like “oclgpu0”, “noocl”.
Finally, here is how to run the tests:
opencv_perf_[module name].exe –gtest_filter=*OCL* --gtest_output=xml:[output file as above].xml
That’s it! Simply run this on the platform of your choice. The good thing about OpenCL, and OpenCV, is that the OpenCL runtime is loaded dynamically. So you can build your code once (or at least once per cmake configuration and OS). You can also distribute the binaries to the test platforms, as long as they have the same OS. OpenCL will be loaded at runtime and your code will just work, using the OpenCL driver that is present in the system, and independent of whichever IDE you used to compile the “host” code. Of course, you can also disable OpenCL, as explained above, and then you end up with performance data on a native run, under the transparent API!
For illustration purposes in what follows, I am using OpenCV 3.0 code as of 9/16/2014, on a 35 Watt AMD Embedded R-Series APU, the RX-427BB. Some of you may know this as “Bald Eagle,” an embedded Kaveri APU with 8 OpenCL compute units. I ran the imgproc module twice on that platform: first enabling and then disabling OpenCL via the environment variable, as explained above.
Another great thing about OpenCV is that it comes with scripts you can use to compare different runs. They are located in [code_dir]\misc. Here is how to use my favorite, [code_dir]\misc\summary.py.
python.exe [code_dir]\misc\summary.py -o htm [out1].xml [out2].xml > [comp].html
You can supply as many result files (of the same module) as you would like; summary.py will align the data and give you a nice web page with comparisons.
But wait, there’s more! While the web page is very useful, you can also load the data to Excel, and do your own statistics and plots. The summary.py script can output in csv format, but I have found that it is easier (i.e. less manual editing) to output in html, and then load the html into Excel. As the figures below show, in Excel you get external data from the web, and in the resulting dialog, just select the [comp].html file from above.
After importing, I usually clean up the file a bit, although you may be OK with it “as is.” Personally, I like to eliminate the comparison columns, which I can regenerate within Excel, and eliminate some top rows. I also substitute “ms” in the performance columns with an empty space. I split the “name of the test” into three columns, as follows: First, add two more columns next to “Name of test”, call them “subtest” and “config”. Go to Data->”Text to columns”->Delimited->Other (select “:” and treat consecutives as one). The figure below shows you the results.
To calculate the OpenCL advantage column (column F in the figure), I divide how long it takes to execute a test natively (e.g. C++ without/with IPP), by how long it takes to execute it in OpenCL. If this ratio is greater than one, then good news for OpenCL!
A Chart with the Advantage of OpenCL vs Other Native Runs
Last but not least, we can conveniently summarize the results using Excel’s PivotTable, or PivotChart. You can configure it as shown in the figure below.
You get a very nice chart with the advantage of OpenCL vs other native runs, per test. The “grand total” is a cumulative average, across all tests. Even if you aren’t familiar with pivot tables, I’m sure you can derive the information you need from the spreadsheet data, as you prefer.
Obviously, in OpenCV not all algorithms are implemented equally well, and there is some obvious low hanging fruit. For example, it wouldn’t violate any laws of physics if “integral” was faster in GPU/OpenCL than in CPU, (if fact it should be!) so there is some work left to be done. This is open source code, and we invite the community’s help!
Straightforward Performance Testing with OpenCV 3.0
Overall, the real purpose of this article is to show you how straightforward performance testing is with OpenCV 3.0, and to invite you to follow the steps above. Compare the performance of OpenCL runs against native runs of your choice, on platforms of various capabilities, and decide for yourself. You will likely want to compare “best runs” on comparable platforms (e.g. comparable in terms of power or price). It turns out, perhaps not surprisingly, that a “best run” may be platform (vendor) dependent. After all, if the platform does not support OpenCL, OpenCL will not win! However, under the transparent API those are details that do not really matter. It is the same code after all. You can write your code once and it will just work, both natively and OpenCL-accelerated, under the control of just an environment variable.
Thanks for your continuing support.
Dr. Harris Gasparakis is AMD’s OpenCV project manager, technical lead, and evangelist. His postings are his own opinions and may not represent AMD’s positions, strategies or opinions. Links to third party sites, and references to third party trademarks, are provided for convenience and illustrative purposes only. Unless explicitly stated, AMD is not responsible for the contents of such links, and no third party endorsement of AMD or any of its products is implied.
OpenCL and the Open CL logo are trademarks of Apple Inc. used by permission by Khronos.