This article is part of an occasional series about what developers can do when they collaborate. AMD is a real believer in open source projects. Our developers actively contribute to and maintain a variety of open source projects, from highly optimized math libraries to… well, let’s talk about Blender Cycles.
Blender is a free, open source 3D animation suite. Blender includes the Cycles render engine. The Cycles engine converts a 3D model into the 2D representation you see on the computer screen, using ray tracing technology. Ray tracing is a very math-intensive process. In fact, it is so math-intensive it is not commonly used for most games or other applications. Ray tracing produces very pleasant visual effects and is typically used when rendering for film or other cases where real-time results are not required but high fidelity is. Using the compute capability of a GPU can improve performance of such renderers.
AMD undertook to improve the support for GPU compute inside Blender Cycles. Prior to this effort, the GPU kernel used for rendering was monolithic and huge. As a result of the kernel’s size, the generated code had to spill/unspill registers. These spill/unspill operations cause slower performance, and reduced occupancy. (Occupancy represents the actual number of waves running on the GPU simultaneously. More is better.)
In addition to producing inefficient code, the compiler would sometimes not successfully complete the build, or would generate incorrect code that could lead to black screens or a kernel hang. These are certifiable “bad things.”
At the Blender WIKI page on OpenCL™, you’ll see this: “Cycles was included into blender with the release of 2.61 in December 2011. The release notes mention: ‘… OpenCL, which is intended to support rendering on AMD/ATI graphics cards’. Ever since the support or lack thereof in cycles has been a topic of debate.” For sure.
(Note: it’s a WIKI page. This text might be updated at any time. This is what was there before we did our work!)
There is a Q&A at the end of that WIKI page, and this is intriguing.
“Q: Why don’t you just split up cycles so it can run better on AMD hardware?
“A: While this would likely help it is not a trivial matter to split up cycles in this way. Also it is not clear that it is going to help and how much. As a resource constrained open-source project this will most likely not be a top priority.”
That’s the need that AMD attacked. It is (or perhaps I should say “was”) a decidedly non-trivial task to split the existing monolithic and large kernel. A resource-constrained open source project has to prioritize carefully. So AMD dived in to help.
(Edit for clarification: We didn’t do this entirely on our own, by any means. Like any well-run open source project, there is a commit process. Our submission was reviewed and modified by the Blender community.)
We turned the monolithic kernel into a pipeline of about 10 new small kernels that run in sequence. Conceptually, the algorithm works like this: move all the required data from the CPU to the GPU. This takes time, and is a performance hit. But the computational performance gains will outweigh this transfer cost. Then the smaller kernels operate on the data, communicating via device memory to avoid relatively slower communication with the CPU.
Once the small kernels have processed the data, the data goes back to the CPU. The CPU may then repack the work for the next iteration, so it processes more efficiently. Multiple iterations follow until enough have occurred to provide the image quality desired. The number of iterations varies, depending upon the nature of the scene being rendered, and the quality settings.
The packing process performed on the CPU is significant. Multiple waves that each end quickly are packed together into fewer waves, increasing utilization. Each iteration becomes more and more efficient, typically following an asymptotic curve – lots of improvement in the second iteration, a bit less in the next, and so on.
For example, assume you have a GPU with the capacity to run 50 waves simultaneously. But you have 500 waves you must run to process your data. You start with 10 batches of 50 waves. After the first round, the CPU packs the data and the number of waves might drop to, say, 300. Waves that ended quickly, leaving compute units idle within the GPU waiting for other waves to complete, are packed into a single wave. This means less idle time in the next iteration. Not only that, now there are fewer batches to process.
So, what does this look like in the real world? Well, a Beemer seems like a nice car, so we picked that as a model to render. You can get the Blender model here. The model is pretty nice.UPDATE and Editor’s Note: Based on excellent feedback from our readers, we have added info on discrete GPU performance. The text below reflects this change.
We rendered the image in Figure 1 three ways. Precise system details are below.
- using the CPU for all calculations
- using the integrated graphics capability of an AMD APU
- using an AMD graphics card
Handily, Cycles tells you how long it takes, so no stopwatches were injured during these tests.
So, what difference does it make to enable GPU Compute in the new Blender Cycles?
See for yourself.
For the CPU Only and APU Compute tests, we used an AMD A10 7800B APU. The computer had 8 GB of memory. We were running Windows® 8.1. For the discrete GPU test, we used a Radeon™ HD 7970 (Tahiti).
As noted earlier, ray tracing is very mathematically intense. Without GPU compute, rendering this model took a bit more than 38 minutes. The APU test took 9:38. With the Radeon graphics card, 1:42.
Your mileage may vary. For example, the Radeon graphics card has 32 compute units vs. 8 for the APU. So it is clearly capable of getting through the math faster. As well, there are many options you can set in Cycles (like quality) that will affect the time it takes to render the image. Nonetheless, this gives you a flavor for the kind of speed improvements that GPU compute can provide.
These changes to the OpenCL kernel inside Blender Cycles are available in version 2.75. You can get Blender 2.75 here.
So who wins here?
AMD wins. Our culture of supporting open industry standards and open source projects means that software runs better on our hardware. From our perspective, cool tools like Blender make our graphics cards more popular. We are a for-profit company after all. But why not help everyone else along the way?
Users win. Blender users, creative artists—many of them perhaps in the indie community—get a faster tool.
Developers win. Remember that quote from the Blender WIKI about this being non-trivial work? Very true. Somebody needed to do it, and we have the skill and the experience. AMD’s work is now out there in the open source community for you to see and learn from. Not to mention the direct benefit to those who, in their copious spare time, build and maintain Blender. Our hat is off to you, we’re happy to help out.
More generally, we build tools for developers so you can accelerate code by taking advantage of the available GPUs. You can learn about and download the Accelerated Parallel Processing SDK at AMD’s Developer Central website. You can learn a lot at our blog series, OpenCL 2.0 Demystified.
This is a win-win-win scenario. What’s to lose?
Jim Trudeau is Senior Manager for Developer Outreach at AMD. Links to third party sites and references to third party trademarks, are provided for convenience and illustrative purposes only. Unless explicitly stated, AMD is not responsible for the contents of such links, and no third party endorsement of AMD or any of its products is implied.
Windows is a registered trademarks of Microsoft Corporation. OpenCL is a trademark of Apple Inc. used by permission by Khronos. No endorsement of AMD or any of its products by BMW AG is expressed or implied.