hcInnovator Spotlight
This is where we recognize the individual hcInnovators who have brought OpenCL and Heterogeneous Computing to new heights.
- Douglas Andrade, Aeronautics and Mechanical Engineer at Petrobras
- Vincent Hindriksen, Founder of StreamComputing
- Dr. Guoping Long
- Matthieu Delahaye, Solutions Architect at MulticoreWare Inc.
- Victor Oliveira, Grad student at the University of Campinas (Brazil)
- Daniel Moth, Principle Program Manager, C++ AMP, Microsoft
Douglas Andrade, Aeronautics and Mechanical Engineer at Petrobras

1.Can you tell us a little bit about yourself?
My name is Douglas Andrade, I come from Brazil and I am 28 years old. I founded the CMSoft website where we post numerous OpenCL tutorials and case studies. Professionally, I am an Aeronautics and Mechanical Engineer, and I currently work at Petrobras, Brazil’s largest energy company, developing new technology for large-scale construction and assembly. My main areas of research are high performance computing and parallel processing with OpenCL when applied to office and project productivity, computational math and artificial intelligence.
2.How did you end up working on heterogeneous computing and your specific project?
Back in 2004, when some friends and I first started developing algorithms to address specific problems in the labs, such as uncertainty propagation and graph information retrieval from scanned data, we noticed that the industry was already pointing in the direction of massive parallelism. In 2010, however, along with my CMSoft partners Diego and Edmundo, we finally found the perfect tool to harness the computing power of all CPUs and GPUs: OpenCL. From then on, we really became addicted to how easily OpenCL gave us 10x speedups and in some cases, how stunning it was to get 300x to 800x speedups. Additionally, powerful interoperation combining OpenCL with OpenGL gave us the ability to precisely control what to store in each available cache allow smooth visualization of simulations.
3.How do you see OpenCL shaping the industry moving forward in your work?
High performance is critically important in an enormous variety of fields that relate to my work in technology development, such as image processing, information retrieval, artificial intelligence, cloud computing and massive databases. Professionals in these fields have begun to realize how powerful a tool OpenCL is, as we start to see image processing software vendors implementing their algorithms in OpenCL and initiatives to develop OpenCL programmable FPGAs. All these applications, however, are still in the relatively early stages of development so there’s much yet to be done.
In addition, research and academia scientists often need big clusters of computers to perform extremely complicated simulations, and this hardware is not always easily accessible. OpenCL enables all researchers to utilize the enormous computing power available in both the CPU and GPU of desktop computer to properly debug their formulas and software, effectively increasing their productivity, not to mention that when all computers in a cluster are equipped with GPUs it would come as no surprise to see a performance increase of 100x utilizing cluster computing.
In the energy industry and many others, it is often cheaper and more efficient to train professionals using complicated simulators of flight, boat behavior and welding, to name a few. The mathematical models used in these simulators have been sometimes oversimplified to allow real time execution ion desktop systems give their limited equipment budget; however, now that OpenCL is available, more realistic simulations can be developed to run in real time, thus saving hundreds of thousands of dollars in training hours and consumables.
4.What are the primary benefits you see in using heterogeneous computing and OpenCL in particular?
First and foremost, heterogeneous computing increases productivity. Heterogeneous computing and OpenCL add an immense computing power to devices equipped with multicore processors, which are ubiquitous these days. From the end-user standpoint, the key aspect is that someone who has an eight-core processor probably doesn’t want to use eight applications simultaneously; instead, each application should run much faster. We’re reaching the point where tablets and cell phones will have the same processing power of our current desktop computers just by correctly exploiting heterogeneous computing.
An often overlooked benefit is the possibility to save energy. The industry has been struggling for a long time now pursuing batteries that last longer. It is known that transferring information around from global memory to caches and back consumes a considerable amount of energy and, for the first time, the programmer has the chance to write one single code that works across multiple devices and still control how the data should be distributed across global and local memories and caches. This is all to say that by writing code that targets the GPUs of mobile devices our next generation software will not only run considerably faster but also consume much less power.
5.Do you have any tips or tricks to share with the heterogeneous development community on how to get started with OpenCL programming?
The most essential aspect of OpenCL or any other high-performance environment is to think parallel, i.e., know what parts of your code are independent of each other, and understand the architecture of the devices. The shift from serial to parallel can be challenging and there’s no miracle: studying is necessary.
Start with easy tools that allow the developer to quickly create working code instead of trying to understand the very specific setup details (platforms, devices and contexts) at first. These aspects are critically important but to me it’s much more motivating to start with code that does something much faster. Take a look at existing examples and build the way up to memory caching and coalescing, vectorization, synchronization and atomics.
Professionals interested in simulation and graphics can benefit greatly from incorporating OpenCL and OpenGL interoperability in their software; in this case, it’s necessary to understand OpenGL textures and buffer objects and it’s a good idea to get started by modifying sample code.
Last but not least, read Khrono’s OpenCL specification along with the examples. Tedious as it may seem remember that this is the one language that can target ANY CPU or GPU, and soon FPGAs and other devices.
6.Where can developers go to learn more about you and/or your work?
The CMSoft website, http://www.cmsoft.com.br, is the best place to find out more about our work with parallel processing using OpenCL. Most of our code is open source.
We have been dedicating considerable efforts into using OpenCL and OpenGL interoperation for simulation purposes and I think the information we at CMSoft post online, can be helpful for those who are also going in this direction.
Vincent Hindriksen, Founder of StreamComputing
![]()
1.Can you tell us a little bit about yourself?
Hi, I am Vincent Hindriksen, the founder of OpenCL-company StreamComputing. I am 34, Dutch, endurance sporter, GPU hacker, performance engineer, and a promoter of open standards. Most of my paid time goes to OpenCL training, consultancy and coding.
2.How did you end up working on heterogeneous computing?
One moment in 2007 I got bored when working on a project. So I started optimizing the transfer-speed, caching and algorithms, removing unused code, simplifying the bus and avoiding double computations. I doubled the transactions per second. The year after I got to 6 time speed-up in another project by multi-threading an application. Finally I brought down a batch-process I had to reverse engineer from 2+ hours to 17 seconds, by rewriting most of the bus-code in my free time. Slowly I got addicted to performance and logically bumped into OpenCL. I immediately saw its potential and the year after I founded StreamComputing.
3.How do you see OpenCL shaping the industry moving forward in your work?
Currently the most growth-areas are applications using visual algorithms, physics-computations, FFT-based solutions and operations on huge matrices, where computation-speed always has been the limiting factor. This year I got more interested in the Big Data processing and looking into solution where for instance Hadoop and OpenCL can be used together, as the union-space of Hadoop-algorithms and OpenCL-algorithms is relatively large. This results in “pre-projects” where is made sure the data-feed is capable of feeding the GPU.
Actually most companies which now need to deal with Big Data haven’t fully figured out yet how they can optimize their data-transfers, and therefore are not ready to start speeding up their algorithms. Luckily it gets less and less accepted if an Excel-sheet outperforms the same computations done at the server.
OpenCL will push the affordable high performance market forward for another year, but it is unknown what high-level language will steal the hearts of developers next year. As long as this language is built on top of OpenCL, we’ll keep this traction of increased freedom for hardware-developers. I’m looking forward to see more of OpenCL 2.0 “SPIR” and “HLM”.
4.What are the primary benefits you see in using heterogeneous computing and OpenCL in particular?
Increased performance and more diversity in processors.
Performance, because GPUs can process specific types of algorithms much faster than CPUs are able to do. Now developers can decide to use the GPU for doing compute-intensive tasks and get results in a fraction of the time.
Since some years the market is getting much more diversified, now we will not have more and more GHz each year. Know we would have had a 40GHz processor today, if we did not got stuck at 3GHz! I like it a lot that OpenCL is a main enabling technology in this diversification. Magical products like FPGAs and reprogrammable CPUs never took off big, but chances have increased with OpenCL that we will find these technologies as part of future CPUs. Just look at AMD’s APU, which is not the CPU as we know it.
5.Do you have any tips or tricks to share with the heterogeneous development community on how to get started with OpenCL programming?
I think that understanding the new hardware is most important to start with OpenCL, before even touching code. For example, do you really know the difference between a CPU, a GPU and an APU? Do you know when better to use AVX/SSE and when the GPU? How much nanoseconds does it take to get an image ready for processing on the various GPUs?
So I came up with the following steps.
1.Understanding the hardware and architectures.
2.Thinking both in parallel and in vectors.
3.Learning the OpenCL language itself.
4.Profiling and debugging.
Also when learning another GPU-language, these steps hold. I noticed during the trainings I gave people expected to start with step 3, as they have done when learning another CPU-language.
6.Where can developers go to learn more about you and/or your work?
Just wander around the website http://www.streamcomputing.eu – there are currently 100 blog-posts to explore, with loads of information on OpenCL, GPGPU and various other subjects. I answer all e-mails I get via http://www.streamcomputing.eu/solutions/ask-your-question/ personally.
Dr. Guoping Long
for his leadership role in the OpenCV-OpenCL project and his translation of the Introduction to OpenCL™ Programming textbook in to Chinese.
1.Can you tell us a little bit about yourself?
My name is Guoping Long. Now I am wrapping up my one-year visit here at the Archlab of computer science department, UCSB. Next I will go back to China to continue my work at ISCAS. My major technical interests are computer architecture and parallelization techniques of various applications. One important on-going project of our team since 2010 is an optimized OpenCL™ version of the OpenCV library.
2.How did you end up working on heterogeneous computing?
I started wrestling with parallel programming abstractions since 2007, when I was obsessed with the shared memory semantics of multi-threaded programs. Starting from there, I accumulated my understanding of the intriguing interaction between the more and more sophisticated parallel hardware and the more and more demanding application requirements. The growing importance of GPU/Fusion architectures on HPC community is a concrete embodiment of the evolvement of this growing challenging interaction between software and parallel hardware.Although I was fascinated by the GPU architecture and its programming implications much earlier, my actual work on heterogeneous computing did not begin until the end of 2010, when the ISCAS-AMD Fusion Software Center initiated a joint effort to contribute an optimized OpenCL version of the OpenCV library. While this is an on-going project, and still far from complete now, we did learn a lot of things throughout this process.
3.How do you see OpenCL shaping the industry moving forward?
Due to energy constraints, in the foreseeable future it seems to be a practically effective design bias to employ a certain amount of heterogeneity or specialization to fully exploit the abundant transistor resources. Energy efficient heterogeneous core designs, combined with ever increasing number of cores, have many subtle implications on software. OpenCL is a promising programming abstraction designed to address such challenges.As an open industry standard, OpenCL by definition is portable across multiple hardware platforms, and encourages diversified vendor specific high quality implementations. This is useful to fully realize the potential of unique architecture specific features. More importantly, hardware/software vendors (AMD, Apple, etc) are working consistently promoting OpenCL to industry.From our experiences regarding hardware specific optimizations, it is not objective so far to consider OpenCL to be as user friendly as CUDA. But the OpenCL language extension is constantly improving. One notable example is the C++ template support. Looking into the future, I think perseverance is most critical for OpenCL to succeed in shaping the industry, that is, firmly embracing the openness of OpenCL, strenuously improving the user experience, and persistently attracting more and more programmers to develop cool OpenCL based applications.
4.What are the primary benefits you see in using heterogeneous computing and OpenCL in particular?
Well, hardware heterogeneity actually makes programming even harder, which to some extent further exacerbates the already hard problem of sequential programming (correctness/security verification, complexity management of gigantic software projects, etc). One important reason to embrace heterogeneous computing is the hard reality of hardware design trend, namely, to employ heterogeneity to reach an acceptable performance and energy balance. Given this reality, programming with hardware asymmetry in mind can greatly help fully realize hardware potential.As an open standard, one of the primary benefits of OpenCL is its openness and portability. While this does not necessarily mean performance portability in all cases, there does exist common grounds for optimized codes across conceptually similar platforms.
5.Do you have any tips or tricks to share with the heterogeneous development community on how to get started with OpenCL programming?
OpenCL programming is no magic. Practice makes perfect.For beginners, a nice place to look at is the AMD OpenCL zone. The AMD OpenCL SDK also contains lots of nice examples to start with. For experienced programmers seeking optimization tricks, a meticulous study of the hardware architecture will be very helpful. There are also many research papers published each year on various optimization tricks. For people who are interested in OpenCL implementation itself, the Clang/LLVM infrastructure contains a prototype implementation.
6.Where can developers go to learn more about your and/or your work?
Our OpenCL work mainly focuses on the OpenCV library. To access our code, please retrieve a most recent trunk of the library. There are a bunch of optimized functions based on OpenCL now. Please let us know your comments or report bugs. This project is still going on.
Matthieu Delahaye, Solutions Architect at MulticoreWare Inc.

1.Can you tell us a little bit about yourself?
My name is Matthieu Delahaye. I joined MulticoreWare in 2011 where my principal occupation is to support the development of new compiler technologies related to GPU compute. Before joining Multicoreware, I was a Research Engineer at the University of Illinois where I have been involved in different projects, always focusing on increasing software performance.
2.How did you end up working on heterogeneous computing?
It was a natural transition in my career. I first started on working on single core performance. When multicore CPUs came, my worked focused on parallel computing on multicore system. When I heard about GPGPU, I was naturally interested and wrote few pixel shaders on my free time. Then the project I was working on ended and it was time for me to move to a new project. My next project was to write a static analyzer for GPU code to predict the performance of the kernel on a GPU. I did not have any other choice than learning more about it. This was about 4 years ago and now all the subsequent projects I have been involved in have always been related to GPU computing.
3.How do you see OpenCL shaping the industry moving forward in your work?
While some may be challenging that OpenCL is the ultimate programming platform for heterogeneous computing, it is difficult to ignore that as of today OpenCL is the de facto meeting point between hardware and software. More and more hardware vendors are joining the OpenCL bandwagon or are planning to, OpenCL is the first technology that software vendors have in mind when performance matters. I will not forecast on whether OpenCL will stand that same place in a couple of years, but I am confident that we will continue to use the same execution model. As for my work, I think this will keep me busy for at least a decade.
4.What are the primary benefits you see in using heterogeneous computing and OpenCL in particular?
Obviously, performance is the first that comes to my mind. Portability is another one. Even if the performance of OpenCL varies greatly across the different platforms that support OpenCL. With the emergence of smartphones, it is more and more difficult to target a software product for a single operating system over a single architecture. Each platform may dictate the language used, but OpenCL may play an important role on providing a consistent development language and API for performance critical code.
5. Do you have any tips or tricks to share with the heterogeneous development community on how to get started with OpenCL programming?
A tip: An OpenCL optimization that was working well few years ago may be wrong today. So if you pick up a trick online, ensure it still does apply today for your card model. That could save you from a useless and time consuming code refactoring.
Victor Oliveira, Grad student at the University of Campinas (Brazil)

1.What made you interested in Image Processing and Computer Vision, and what prompted your move to biomedical imaging and machine learning?
I got involved first in Robotics as an undergrad research project, but soon I realized I liked more the software part of the work, so it was a natural move to get interested in Computer Vision and Image Processing. When I entered the grad school, my advisor works with Biomedical Imaging and it’s great being able to help people, also I was guinea pig in many imaging experiments in the University’s hospital, so I have many pictures of my brain
It’s a current trend to use Machine Learning in problems that are hard to formalize, which is usual in Image Processing (besides, ML is pretty cool).
2.Can you tell us a little more about your thesis?
I did some work in Biomedical Imaging which helped my lab colleagues, specially with some legacy code with was very slow and I optimized to use Multi-threading and GPU (first CUDA, now OpenCL). But after working that much in Image Manipulation and GPUs (GPU support in GIMP), my thesis will be probably a guide to my colleagues in the imaging field whom can get very confused is this new world of many architectures!
3.Did you find it easy to learn and implement OpenCL?
I don’t think the OpenCL API in itself is hard, in fact it is much cleaner than others like OpenGL and CUDA (my opinion). Things can get hairy when you have to integrate OpenCL in an existing application which doesn’t take performance on account, specially when data processing is split in many functions and you have to put all this in a kernel.Another thing that makes development hard is error handling, an OpenCL call is not guaranteed to be successful (but I can’t figure out a better way to do it in C).
4.How do you determine if an algorithm is a good fit for GPU processing?
In this order of priority:The problem must be data-parallel, i.e. we should be able to decompose the problem in small parts that can be solved independently.
Its memory access should be non-random. For example, Union-Find algorithms perform bad because they have random accesses when merging the structure 3. In general, this is a good formula:for each work-item:(number of arithmetic operations) / (number of loads and stores in global memory).Memory accesses (even coalesced) are much slower than floating-point operations, so the bigger this ratio, the better.
5.How is the OpenCL performance?
Great, it’s surprising that I can get good results even in the CPU.There is some real magic happening there
6.What are your favorite tools for OpenCL development?
I think the AMD APP Profiler is a great tool that I have yet to master completely.Linux:GCC, Vim, AMD APP Profiler, OProfiler.
Windows:
MinGW, Notepad++, AMD APP Profiler, AMD Kernel Analizer.
That’s it!
Thanks,
Victor Oliveira
Daniel Moth, Principle Program Manager, C++ AMP, Microsoft

1.Can you tell us a little bit about yourself?
My name is Daniel Moth. I started with Microsoft in the UK in 2006, working on the Developer and Platform team. I moved to the US in 2008 to join the Parallel Computing team in Visual Studio. It was like joining a startup company within a company and incubated all kinds of parallelism-related technologies – some that shipped and some that didn’t. Outside of work I enjoy playing chess and, when I can get away from Seattle, SCUBA diving with my wife.
2.How did you end up working on C++ AMP and heterogeneous computing?
In the Parallel Computing team, my focus for the first couple of years was parallel debugging both for .NET and native technologies, and that shipped with Visual Studio 2010 (any fans of the Parallel Tasks and Parallel Stacks windows?). In the last couple of years, for Visual Studio 11, my ownership expanded to include GPU debugging, for a new approach that nobody outside of Microsoft had seen (and that we now know as C++ AMP). As time went on, I guess the programming model side of things needed me more than the tooling side, so I made the short side step to be the Program Manager for C++ AMP, rather than for parallel and GPU debugging in Visual Studio. I have been very lucky because in both sub-teams of the same broader team, I worked with individuals that are much smarter than me, so every day I learn and grow from interacting with them – that for me counts more than the technology I work on.
3.How do you see C++ AMP shaping the industry moving forward?
In a way, C++ AMP will not shape the industry, but will rather elevate as the de facto programming model choice, in my biased view. The hardware industry is definitely still in flux, and the capabilities, differentiation, and convergence of hardware is still in motion. Microsoft has partnerships with all major hardware vendors, so we are uniquely positioned to propose a solution that we think will transcend time, and we received valuable feedback on it, including from AMD. It is much easier to work with a programming model that was designed from the outset to cater for the hardware evolution that we are witnessing, rather than to pick up an API that gets continuous patches to expose (and tie itself) more and more to today’s latest hardware release.
4.What are the primary benefits you see in using C++ AMP instead of DirectCompute or OpenCL or CUDA?
I believe that C++ AMP has a truly future-proof design. It is the most sensible strategic bet a company can make as they look for a heterogeneous programming solution that satisfies their GPU requirements today, while protecting their investments for tomorrow’s hardware evolution. If you leave that aside and instead just focus on today, then you’ll find that C++ AMP lowers the barrier to entry for developers to take advantage of accelerators such as GPUs, and is a much more productive environment to code in than any of the alternatives, utilizing modern C++ constructs, while still offering hardware portability that is second to none (and the potential of platform portability through the open specification).
5.How should developers stay up to date on C++ AMP and learn more?
There are many options for developers to learn C++ AMP today, including articles, screencasts, recorded presentations & slides, learning guides for developers familiar with CUDA/OpenCL/DirectCompute, many samples, a book by Kate Gregory and Ade Miller, as well as a deep dive training course by Acceleware. For the hardcore folks, you can download the 130 page C++ AMP open specification document and read it cover to cover. After all that if you have follow up questions, we respond to forum questions within 24 hours, which I am told is better than most paid support services. The answer to “Where can I find links to all those?” is the same as the answer to “How can I stay up to date?” – visit and subscribe to our team blog: http://blogs.msdn.com/b/nativeconcurrency/