AMD recently released the Open64 Compiler Suite version 4.2.2, bringing new and advanced capabilities to x86 software developers. The Suite includes tools for compiling C/C++ or Fortran code into machine code that is optimized for AMD and other x86 platforms. By including optimizations that can automatically target the microarchitecture capabilities of a given system, the Suite helps empower developers to create blazing fast software with minimal effort. This article will discuss the history and background of the Open64 Suite, look at what a user can expect to get with this Suite of tools, and then dive into some examples of how the Suite might be used in practice to optimally compile code for execution on AMD platforms.
Open64 is the result of research work done by compiler groups around the world. It was initially created by Silicon Graphics Inc. and called Pro64. It was licensed under the GNU Public License (GPL) in 2000. The maturity and flexibility of the Open64 Compiler Suite are demonstrated by the fact that it has been retargeted for a number of architectures beyond the original MIPS-based platforms, including x86 and x86-64 processor architectures.
Open64 leverages various aspects of the GNU compiler collection (GCC) which is standard in virtually all Linux® environments, and it fully supports C/C++ and Fortran programming languages. In addition, Open64 currently has PREview-level support for the shared memory programming model OpenMP, which facilitates application development that targets multiprocessor environments. OpenMP provides a simple interface for writing applications with parallel code through a set of compiler directives, library routines, and environment variables. In today’s environment where multicore processors are the norm, Open64 is a fundamental tool for all serious x86 developers.
What is the Open64 Compiler Suite?
The Open64 Compiler Suite includes all of the tools necessary to compile high-level C/C++ or Fortran code into machine code that performs optimally on a given x86 platform. These tools can be directed to optimize code for specific microarchitectures that may have features or cache sizes that are not available on PREvious processor generations. For example, a later microarchitecture may support the latest and greatest instructions, such as support for SSE instructions, or it may have more on-die cache for an application to leverage. If the compiler is unaware of what hardware features are available, then it may be difficult to create optimal code for that platform. With Open64's advanced tools, developers can tune their code to achieve maximum performance on any x86-based system.
There are several new specific features that have been added to Open64. Let's take a look and see how you can take advantage of them.
2 MB Huge Pages
Open64 now includes support for 2 MB (2^21) huge pages. This is an important feature because it enhances performance for applications with large data sets. For many years, the typical page size was 4 KB, and this is still the default today. However, as system sizes have scaled from one or two to a dozen or more processor cores, memory sizes have kept pace and have grown from hundreds of megabytes to tens of gigabytes. As this integration progressed, applications have evolved to take advantage of the additional processing power and memory size. With large memory footprints becoming common, a 4 KB page is simply too small to simultaneously map a large memory footprint. In response to these trends, processors implemented “huge” page sizes to reduce PREssure on the limited number of TLB entries available. While packaged applications with large data sets such as databases have leveraged huge pages for several years, Open64’s support for huge pages makes this feature accessible to all application developers. This support enables the processor’s resources to be utilized effectively on an application-by application basis, with a mix of small and huge pages as appropriate.
Interprocedural Analysis
Interprocedural analysis (IPA) is another of the major improvements in the latest Open64 Compiler Suite. IPA will analyze code across function call boundaries, even across multiple files if necessary, in an attempt to improve overall performance. For example, it will look for opportunities for inlining functions to reduce function call overhead. Further, it will perform a technique called partial inlining where a function and its caller can be reorganized such that the code can avoid unnecessary calling of the function. IPA will also analyze whether any benefit can be gained from reordering structure members to gain better memory layout or locality. Finally, IPA is an optimization that not only provides benefits in and of itself, but code that has been inlined can often be further exposed to other optimizations that would not otherwise be possible, such as loop optimizations. Listing 1 includes some sample code to demonstrate how IPA can optimize an application.
Listing 1: Interprocedural Analysis Example Code
#include <stdio.h>
int main()
{
printf("%d\n", factorial(0));
return 0;
}
int factorial(int n)
{
int answer, i;
if (n < 0)
{
printf("Error: input number %d is negative!\n", n);
return -1;
}
else if (n == 0 || n == 1)
{
return 1;
}
else
{
answer = 1;
for (i = n; i > 1; i--)
{
answer *= i;
}
return answer;
}
}
In the code in Listing 1, IPA can inline the call to factorial(), exposing the argument (0) to the function, allowing the optimizer to directly assign the result of the call to the caller. We compile the code first with the –O3 optimization (IPA is not included as part of –O3):
% opencc -O3 example2.c
Even with the -O3 global optimization turned on, the following call is still made from the main function:
callq 40058e <factorial>
However, when compiled with the -IPA switch, the compiler is able to optimize away the function call and simply put the result in the destination register:
% opencc -O3 example2.c -IPA
An analysis of the compiler output shows that the call instruction is replaced by the following move instruction as shown here:
mov $0x1,%rsi
Another improvement in Open64 to note is that data structures are more aggressively optimized to improve how they are accessed in an application. To accomplish this, Open64 will analyze structures and how their various members are used in the application and reorder those members so that frequently accessed items are close together, improving memory reference locality. This improves performance by reducing cache misses. Open64 is even capable of performing these optimizations on complex data compositions such as arrays of structures.
Loop Nest Optimizations
The Open64 Compiler Suite offers a powerful set of optimizations targeted to improve the performance of loops. These loop nest optimizations (LNO) include loop interchange, loop blocking, loop unrolling, loop fusion, and loop fission, etc. The following shows an example of how Open64’s LNO can optimize a loop nest.
In Listing 2, we have a snippet of a couple of nested loops that perform some simple arithmetic in the inner part of the loops.
Listing 2: Loop Nest Optimization Example Code
...
float a[10000][10000], b[10000][10000];
...
for (i = 0; i < 9999; i++) {
for (j = 0; j < 9999; j++) {
a[j][i] = i+j;
}
}
for (i = 0; i < 9999; i++) {
for (j = 0; j < 9999; j++) {
b[j][i] = (a[j][i-1] + a[j][i] + a[j][i+1]) / 3.0;
}
}
...
The compilation switch “-O3” turns on LNO. LNO analyzes the access patterns of the arrays a and b in the above example and applies the loop interchange optimization to the loop nests, effectively swapping the i and j loops. As a result of this interchange the access patterns are much more cache-friendly.
Application programs with loops similar to the above code snippet may see their run time performance speed up dramatically when compiled with Open64’s loop nest optimizations.
Open64 also improves upon a couple of classic compiler optimizations called loop unrolling and loop fusion. Loop fusion is an optimization that attempts to improve performance by combining multiple loops into a single loop. The idea here is to decrease loop overhead by performing more work per loop. Loop unrolling has a similar goal in that it attempts to reduce loop overhead by expanding the loop body to contain multiple iterations of the loop. As such, the overhead of repeatedly checking the loop condition is reduced. It should be noted that both of these techniques have the potential to degrade performance if used blindly, but Open64 can avoid this issue because it is cognizant of the cache and internal register implications of such transformations. The latest version of Open64 improves loop fusion through better reuse of array elements resulting in superior cache utilization. In addition, loop unrolling can now kick in for a loop whose body spans multiple basic blocks to provide even better optimized loops.
Head/Tail Duplication and If-Merging
Another class of improvements in Open64 includes scalar replacement, constant folding, head/tail duplication, and if-merging optimizations. The goal of scalar replacement is to reduce memory references. This is achieved by replacing array references with scalar references. Constant folding is an optimization that simplifies constant exPREssions at compile time resulting in faster code at runtime. If-merging and head/tail duplication are techniques whereby code is reorganized to bring loops closer together in order to enable the possibility of applying loop optimizations on them.
Many optimizations will perform best if the compiler is given a hint about the target microarchitecture through the –march switch. This way it can optimize the code to use all of the available processor cache, the latest supported instruction set, and any other hardware features that might improve code performance. This switch is also important for taking advantage of optimizations like instruction selection and instruction scheduling.
Instruction selection involves picking the optimal set of instructions to carry out the intended operation. This requires knowledge about the set of supported instructions, register count, pipeline information, etc. Instruction scheduling optimizations require similar knowledge about the target hardware to attempt to reorganize code to perform faster. Of course both of these optimizations are only useful if the original meaning of the code remains unchanged in any way. Furthermore, indicating which microarchitecture to target may cause the compiler to link in library code that has been optimized for the target platform, giving further performance advantages to the application. For complete details on how to specify a particular microarchitecture, refer to the user guide linked at the end of this article. For example, to compile for the latest AMD “Barcelona” processor, your command line might look something like this:
% opencc -O3 –march=barcelona code.c
By now you should be ready to install the Open64 Suite in your own environment, so let's look at a few examples of how to use Open64. First you'll need to download and install the Suite on your system. Refer to the Open64 website (see Resources) for details on how to download and configure Open64 for your needs. Once setup, the following examples will help in getting things off the ground. A complete user guide is available on AMD's site as well (see Resources).
Initial Compiler Test-Drive
The first thing to try will be to invoke the compiler to ensure proper installation. This is done with a simple call to the appropriate compiler: opencc for C, openCC for C++, or openf90 for Fortran. The command will look something like this (assuming these binaries are in your path):
% opencc <input C files>
% openCC <input C++ files>
% openf90 <input Fortran files>
Of course, there are plenty of switches to play with that will tune your code in various ways. Here are examples of some basic switches of interest. To turn on all global optimizations and take an aggressive approach to optimizing your code, the -O3 option can be used as follows.
% opencc -O3 <input src files>
Note that the -O3 option will cause the compiler to not worry so much about compile time. Instead it will focus on applying the most optimizations possible. Note that global optimizations don't include the interprocedural analysis discussed earlier in Listing 1. Here's an example command line with IPA enabled:
% opencc -O3 -IPA <input src files>
Now consider a program with a lot of math operations, floating point operations in particular. It's possible you want them to operate as quickly as possible and are not too concerned about strict adherence to ANSI/ISO or IEEE standards. Open64 provides a switch to optimize such math operations:
% opencc -ffast-math <input src files>
Alternatively, it may be desirable to strictly adhere to these standards, in which case the opposite switch can be applied:
% opencc -fno-fast-math <input src files>
Finally, if you just want to let Open64 optimize for speed, then the -Ofast switch can be used. This switch is the equivalent of applying -O3, -IPA, -OPT:Ofast, -fno-math-errno, and -ffast-math together:
% opencc -Ofast <input src files>
There are literally hundreds of switches available for controlling how Open64 compiles your code. We'll look at only a few here, so check out the user guide for complete details on each one. If all the optimizations are turned off, Open64 will attempt to reduce the cost of compilation and produce code that can easily be stepped through in a debugging session. Depending on the level and number of optimizations applied, debugging the resulting code can become difficult with the tradeoff of increased code performance.
Global Optimizations
First let's take a look at a few global optimizations and how to control them. Similar to the -O3 switch we've seen already, there are also -O0, -O1, and -O2 switches available. Progressing from -O0 to -O3 indicates to the compiler to perform progressively more and more aggressive optimizations designed to increase code execution speed. On the other hand if you'd like to reduce code size, then there is an -Os switch available that will perform a subset of -O2 optimizations, excluding those that tend to increase code size, as well as some additional optimizations aimed at size reduction. Another global optimization of interest is the -apo switch, which tells Open64 to transform sequential code into parallel code when a speedup can be gained on a multiprocessor target. Consult the user guide for details on how to explicitly tell the compiler to parallelize specific blocks using OpenMP directives. Here we use a couple of switches to add speed optimizations and attempt to parallelize code:
% opencc -O2 -apo <input src files>
Next, let's look at some interprocedural optimizations such as inlining, constant propagation, and field reordering for structures. Open64 offers fine-grained control over what is inlined. There are switches to inline nothing at all, inline everything, or something in between. To inline everything try this switch (careful, code size can increase significantly!):
% opencc -O2 -INLINE:all <input src files>
Or to back off a little bit but still aggressively inline, try this:
% opencc -O2 -INLINE:aggressive=ON
Or if you really don’t want to get too wild but still want to optimize simple functions:
% opencc -O2 -finline-functions
Finally, if you just want to PREvent all IPA inlining from happening, use this switch:
% opencc -IPA -IPA:inline=OFF
Next, IPA constant propagation replaces formal parameters by their corresponding constant values, eliminating the need to pass those parameters. As with inlining, different levels of aggressiveness can be selected. Aggressive constant propagation is enabled like this:
% opencc -IPA -IPA:aggr_cprop=ON
The default level of constant propagation can be explicitly enabled with this switch:
% opencc -IPA -IPA:cprop=ON
Or you may find that you want to disable constant propagation altogether:
% opencc -IPA -IPA:cprop=OFF
Next, field reordering will try to better arrange members of large structures based on reference patterns obtained during feedback compilation (refer to the user guide for details on feedback compilation). The intent is to minimize data cache misses. Here’s how to enable it since it is off by default:
% opencc -IPA -IPA:field_reorder=ON
Finally let's look at how to use some loop nest optimization (LNO) switches. First note that at least -O3 must be specified in order to enable these optimizations. Loop nest optimizations can provide tremendous performance gains to your code, but can also increase code size. As a result, these switches should be used with great care. A general switch is available for toning down LNO if desired, causing the compiler to supPREss nearly all LNO:
% opencc -O3 -LNO:opt=0
By default, the compiler will do full LNO, equivalent to using this switch:
% opencc -O3 -LNO:opt=1
The following switches provide more fine-tuned control over loop fusion specifically. First, to supPREss loop fusion entirely:
% opencc -O3 -LNO:fusion=0
Or to allow traditional loop fusion to happen:
% opencc -O3 -LNO:fusion=1
And finally to tell the compiler to perform very aggressive loop fusion:
% opencc -O3 -LNO:fusion=2
Another subset of LNO that can be fine-tuned is loop unrolling parameters. There are several switches for managing loop unrolling optimizations within LNO. To explicitly supPREss loop unrolling:
% opencc -O3 -LNO:full_unroll=0
Or to limit the number of unrolls of a loop to a maximum of N:
% opencc -O3 -LNO:full_unroll=N
Or to just specify the maximum size of the fully unrolled loop:
% opencc -O3 -LNO:full_unroll_size=N
Many other switches exist, so refer to the user guide for complete details. Familiarity with the various knobs available in the Open64 Compiler Suite will help enable you to optimize your applications and achieve unPREcedented performance gains. Open64 has the complete backing of AMD and other organizations, which means there are plenty of resources to turn to for help and collaboration when developing with Open64. Not only that, but the development of compiler enhancements is ongoing, so you can use Open64 knowing that it will continue to be supported and improved.
The Open64 Compiler Suite provides a rich set of tools for fine-tuning today's advanced applications for execution on x86 platforms. This document only provides a brief glimpse into the many advantages to be had from using Open64 as an integral part of your development efforts. Refer to the many resources available through AMD and the Open64 website for additional details and support.
Open64 Compiler Suite User Guide: Documentation on all the command line switches and usage models of Open64.
AMD’s Open64 Site: Link to Open64 resources on AMD’s website.
Open64 Website: Home of the Open64 Suite including forums, download links, and other resources.