Students from Bentley and Northeastern University won the Super Computing 2013 (SC13) Commodity Cluster Challenge using AMD APU’s. We greatly appreciate the opportunity to work with them on the competition and congratulate them on the fantastic results they were able to achieve. The participating students (Neel Shah, Tushar Swamy, Nick Hentschel, Dmitry Veber, and Conner Charlebois) provided the following blog:
Professor David Kaeli first approached us in the summer about entering the Student Cluster Challenge at Supercomputing 2013. At the time, we had no prior experience building or even operating a super computing cluster, and had little experience in assembling and setting up a custom PC. But thanks to the guidance and leadership of our mentors, Professor Kaeli (Northeastern University), Yash Ukidave (Graduate Student at Northeastern University), Professor David Yates (Bentley University), Professor Irv Englander (Bentley University), and Kurt Keville (Massachusetts Institute of Technology) we could get up to speed fairly quickly.
After our crash course in Supercomputing 101, we started focusing on the competition. Eleven teams from the US, Germany, China, and Australia were selected to compete in the Student Cluster Competition in Denver, Colorado. Our team, The Open Compute Team, was part of the “Commodity Track,” which limited participating teams to creating products with a $2500 USD cost limit and a 15 Amp power limit.
Working closely with our advisors, we first designed a cutting-edge cluster from commercially available parts. We were sure that we wanted a heterogeneous cluster that included both CPUs and GPUs. However, after a cost-benefit analysis, we realized that our budget did not allow a discrete GPU. So we decided to proceed with AMD’s novel Accelerating Processing Unit (APU) hardware: with both a CPU and a GPU on the same die, AMD’s APU hardware presented a cost-effective solution.
Our final design utilized an AMD A10-6800k APU for our head node and AMD A10-5800k APUs for the seven compute nodes. Two solid-state hard drives were attached to the head node while the compute nodes performed a diskless boot. We noted that the APUs gave us great power efficiency—even more than the discrete GPUs we had tested. Professor Kaeli’s association with AMD allowed us to share our design idea with AMD. We also received a generous financial grant to cover the hardware costs of the cluster.
AMD’s hardware not only yielded great performance per Watt-dollar but also provided us a great platform to support all of the open-source applications required by the competition—LINPACK (the classic high performance computing metric), WRF (Weather Research and Forecasting), NEMO5 (NanoElectronic Modeling Tools), Graphlab (a data mining and machine learning toolset and API), and OpenFOAM (Open Field Operation And Manipulation).
The LINPACK application required major utilization of the GPU. Initially, we used the clAmdBlas library to offload the BLAS computations to the GPU. The clAmdBlas library is written in OpenCL and has different function prototypes as compared to the standard BLAS library. To translate the LINPACK Blas calls to clAmdBlas, we wrote wrapper functions, which included context creation, the explicit data movement to the GPU, compute, and cleanup of the context. We were the only team in the competition to use OpenCL.
On the eve of the competition, our code produced matrix size mismatch errors. Time was critical, and we approached the AMD representatives at SuperComputing and other AMD employees known to Professor Kaeli.
We appreciate the support of AMD. The AMD employees were prompt to reply—even though it was a Sunday!—and pointed us to the right sources for solving the problem. We would like to thank Guy Ludden, Brent Hollingworth and Vilas Shridharan from AMD for guiding us through this problem and for constantly being in touch.
The clMath team at AMD helped us understand the problem and get around it quickly. At the end, we changed our wrappers to use the clBlas library, the open-source version of BLAS by AMD. Finally, after a 15-hour grind, we finally got our code to execute. As we had lost a lot of time, we could not tune the BLAS library and hence could not deliver the best results for LINPACK. We are sure that tuning the BLAS would have won us the distinct LINPACK award for the competition too.
After winning the overall competition for the Commodity Track at Supercomputing, we discovered that the Open Compute Team was one of only two US teams accepted to the Cluster Challenge at the International Super Computing conference in Germany. We hope to collaborate with AMD once again and make use of the new FirePro cards for our revamped cluster design.
Read more about these winning students and the competition here: Superstars win at supercomputers