Skip navigation links
Tools
SDKs
Libraries
Samples & Demos
Docs
Zones
Community
Support
New Optimizations in Version 4.2.3 of the x86 Open64 Compiler Suite 
Skip Navigation LinksHome > Tools > x86 Open64 Compiler Suite > New Optimizations in Version 4.2.3 of the x86 Open64 Compiler Suite
Michael Lai, x86 Open64 Compiler Team  3/30/2010 

Introduction

The x86 Open64 Compiler from AMD is a highly optimizing code generation tool for developers who require optimal performance from their x86 and x86-64 Linux applications. The x86 Open64 Compiler provides a set of advanced, state-of-the-art optimizations ideal for multi-threaded applications running on multi-core, multi-processor platforms. The white paper for version 4.2.2 of the x86 Open64 Compiler Suite (“Optimizing x86 Applications with Open64: Using the Open64 Compiler Suite 4.2.2”) briefly introduces the Open64 Compiler, its history, optimization features, and some usage examples. In this paper, we will take a deeper dive into some of the new optimizations offered in the recent 4.2.3 release.

In addition to many stability and compatibility enhancements, the December 2009 general release of version 4.2.3 of the x86 Open64 Compiler Suite delivers yet another series [DBS1] of impressive optimizations:

  • Improved interprocedural analysis to include structure array copy optimization and array remapping optimization.
  • Improved loop optimizations: loop unrolling, loop unroll and jam, triangular loops, proactive loop interchange, loop distribution, and loop peeling.
  • Improved redundancy elimination optimizations for stores and memory initialization, better integration of reassociation and common subexpression elimination, and enhanced expression factorization.
  • Improved instruction selection and addressing code generation.
  • Improved vectorization.
  • Extended prefetching to include arrays with inductive base addresses.
  • Enhanced loop multi-versioning.
  • Improved OpenMP and auto-parallelization code generation.
  • Improved tuning of OpenMP and parallel runtime library functions.
  • Introduced aggressive optimizations to improve scalability/bandwidth utilization of multi-core processors (invoked by specifying the new "-mso" flag).

In the following sections, we will discuss some of these optimizations in detail.

Structure Optimization and Array Remapping

The “struct” data structure provides an easy way to group together different attributes associated with an object. For example, the following struct is a convenient way for a school to represent a student record:

struct {
  int id;  // unique student id
  char *name;
  char *address;
  int phone;
  int ssn;  // social security number
  int grad_year;  // year of graduation
  float GPA;  // cumulative GPA
};

While it is natural and intuitive to associate the above seven attributes to each student, such grouping may not be the most efficient way to process a student’s information in practice. One reason is that these attributes all have very different access frequencies. For example, while the attribute “ssn” is almost never modified or used, “grad_year” is almost never changed but sometimes referenced. Similarly, “name”, “address”, and “phone” are often referenced but seldom modified, and “id” is frequently used but never changed. Finally, “GPA” is both read and updated all the time. An application that spends the majority of its time updating, sorting, and ranking the entire school’s student body based only on GPA may contain the following loops, which are executed very frequently:

for (i = 0; i < n; i++) {  // n is the total number of students
  old_GPA = student[i].GPA;
  … // use old_GPA to compute new_GPA
  student[i].GPA = new_GPA;
}
for (i = 0; i < n; i++) {  // n is the total number of students
  … // code to sort based on student[i].GPA
}

Here, “student[]” is an array, with each of its elements being a struct shown above. The inefficiency is that although only the attribute “GPA” is used in the above loops, each access to “student[i].GPA” brings the entire cache line containing the attribute “GPA” – and, most likely, all the other attributes also -- into the cache. This results in poor cache utilization and presents an opportunity for improvement. When the aggressive interprocedural analysis optimization is in effect, the compiler will try to detect instances of such poor cache utilization and may perform structure layout optimizations to remedy the problem. There are many kinds of structure layout optimizations: structure splitting, structure peeling, and instance interleaving. All of them automatically re-group the attributes in the structure in such a way that accesses to them will be more cache-efficient. Each optimization has its own advantages (for example, ease of implementation, amount of increase in storage) and disadvantages (for example, introduction of additional pointers, computation overhead). The compiler will try to select the appropriate optimization for the occasion.

We pushed this idea of structure optimization a step further in version 4.2.3. What if the “structure” we want to optimize is not a struct, but a large, one-dimensional array? Recall the student example, but -- instead of a struct -- consider the following:

student[0] <- first student’s id

student[1] <- first student’s name

student[2] <- first student’s address

student[3] <- first student’s phone

student[4] <- first student’s ssn

student[5] <- first student’s grad_year

student[6] <- first student’s GPA

student[7] <- second student’s id

student[8] <- second student’s name

student[9] <- second student’s address

student[10] <- second student’s phone

student[11] <- second student’s ssn

student[12] <- second student’s grad_year

student[13] <- second student’s GPA

and so on. (There are some obvious problems here, such as forcing all these attributes to have the same data type, but we just want to use the same example to illustrate the point.) Other than the fact that a structure is now replaced by an array, all the previous observations of inefficiency and poor cache utilization still apply. In this case, even though there is no structure to be optimized, we can still perform a “conceptual” layout optimization. The idea is to group the accesses to the attribute “GPA” together by remapping the array indices:

student[6] (first student’s GPA) -> student[0]

student[13] (second student’s GPA) -> student[1]

student[20] (third student’s GPA) -> student[2]

and so on. This way, the “GPA” attribute for all the students are stored together contiguously from student[0] to student[n-1]. Needless to say, since these n locations now hold the “GPA” attributes, their original values (“id”, “name”, “address”, “phone”, “ssn”, “grad_year”) have to be remapped elsewhere. When all the array indices from 0 to n-1 are uniquely mapped to new indices (one-to-one) and no index is left unmapped (onto), such a mapping is called a permutation. The job of the array remapping optimization is to find a permutation that will solve the cache utilization problem. Unless the compiler can make sure that all the occurrences of the array elements in the program are mapped consistently, it will not perform this optimization. As with the traditional structure optimization, array remapping also requires interprocedural analysis be on.

Proactive Loop Nest Optimization

The proactive loop nest optimization framework was first introduced in version 4.2.2 of the x86 Open64 Compiler in the form of proactive loop fusion. This was augmented by the inclusion of proactive loop interchange in version 4.2.3. In this context, the adjective “proactive” is synonymous with “goal-oriented,” “obstacle-defying,” and “forward-looking.” We will illustrate the idea behind this proactive loop nest optimization framework using loop fusion as an example.

Loop fusion is a familiar loop transformation optimization to reduce loop overhead and expose the fused loop body to further optimizations. The loop nests:

for (i = 0; i < n; i++)  // loop L1
  a[i] = b[i] + c[i];  // statement S1
for (j = 0; j < n; j++)  // loop L2
  d[j] = b[j] + c[j];  // statement S2

can be fused to become:

for (k = 0; k < n; k++) {  // fused loop
  a[k] = b[k] + c[k];  // statement S1
  d[k] = b[k] + c[k];  // statement S2
}

In addition to greatly reducing loop overhead (half the number of increments, “++”, half the number of comparisons, “<”, and half the number of backward branches), the fused loop body suddenly becomes interesting. Before fusion, there was not much optimization opportunity for the separate statements S1 and S2; after fusion, S1 and S2 are now adjacent statements, and the array elements b[k] and c[k] in S1 will most likely be still in the cache when S2 is executed, improving cache utilization. Alternatively, the occurrences of “b[k] + c[k]” in S1 and S2 will definitely present to the compiler some common subexpression elimination opportunities.

But what if the loops L1 and L2 are not adjacent to each other to begin with? That is, what if there are other statements between L1 and L2? Or what if L1 and L2 are inside different conditional constructs?

This is where the proactive loop nest optimization framework comes in. If the compiler can identify, somewhere in the program, two loop nests that will greatly benefit from being fused together (“goal-oriented”), it will try its best to fuse them. If obstacles exist between the loop nests, the compiler will attempt to perform a number of aggressive optimizations to remove those obstacles (“obstacle-defying”). This may include (but is not limited to) complete and partial inlining of functions, if-merging, code duplication, and code motion. The optimizations are chosen to allow the two loop nests of interest eventually to become adjacent to each other (“forward-looking”). The proactive loop nest optimization framework is an automated framework driven by the structure of the program control flow graph; it even iterates itself to look for opportunities to perform more proactive loop nest optimizations, just in case the ones already performed happen to restructure the program in such a way that other similar opportunities become available. Of course, not all loops can be fused, just as not all obstacles can be removed legally, but -- for those that can be fused, and for those that can be removed legally -- it is good to know that the proactive loop nest optimization framework will do its best to optimize them.

Currently, the two proactive loop optimizations are loop fusion and loop interchange. We are considering adding more optimizations to the proactive arsenal in future releases.

Loop nest optimizations in the x86 Open64 Compiler are under the control of the “–LNO” flag.

Prefetching Arrays with Inductive Base Addresses

The purpose of the software “prefetch” instruction is to bring data from memory into the cache before the data is required by the processor. It is usually more effective for the data to reside in the cache, ready for the processor, than for the processor to wait for needed data to be fetched from memory. This is especially true for long, compute-intensive loops dealing with large arrays:

Original code:

for (i = 0; i < n; i++)
  sum = sum + a[8 * i];

Optimized code:

for (i = 0; i < n; i++) {
  prefetch a[8 * (i + 4)]  <-instruction generated by the compiler
  sum = sum + a[8 * i];
}

In this example, the compiler-generated prefetch instruction fetches data in the array four loop iterations ahead of its use. It is true that, for the first four iterations of the loop (i = 0, 1, 2, 3), the processor will have to wait for the data in a[0], a[8], a[16], a[24] to be loaded from memory before their values can be added to “sum”. However, from the fifth iteration on (i = 4, 5, …), it is very likely that the data in a[32], a[40], … will already be in the cache when the “add” instruction is being executed.

The subscript expression “8 * i” is called an inductive expression (inductive with respect to the loop index variable i) because it behaves very regularly (in this case, linearly, with a stride of 8) when i iterates from 0 to n-1. Note that the base address of the array, a, is constant. Arrays with constant base address whose subscripts are inductive expressions very easily lend themselves to prefetch optimization. The compiler usually has no trouble computing a good “prefetch ahead distance” and generating prefetch instructions in the loop. But what about the other way round? What if the base address is an inductive expression but the subscript is constant?

for (i = 0; i < n; i++)
  sum = sum + (pointer_to_a + 8 * i)[0];

This is just another way to present the same example to the compiler. The Open64 Compiler is able to decipher this alternate representation, so can generate appropriate prefetch instructions for it also.

The prefetch optimization in the x86 Open64 Compiler is under the control of the “–LNO:prefetch” flag.

Multi-core Scalability Optimization

In the past few years, multi-core processors have become widely available. Associated with multi-core processors is the issue of scalability. If an application takes 40 minutes to run to completion on a single-core processor, how long will it take on a quad-core processor? Scalability is the study of how much speed-up we get when we upgrade from a single-core processor to an n-core processor. In the 4.2.3 release, we have greatly enhanced the compiler’s ability to optimize for the scalability of applications running on multi-core processors by improving the OpenMP and auto-parallelization code generation support, as well as tuning the OpenMP and parallel runtime library functions.

The OpenMP and auto-parallelization features of the x86 Open64 Compiler are under the control of the “–openmp” and “–apo” flags.

Another measure of scalability is from the standpoint of throughput. If one application takes 40 minutes to run to completion on a single-core processor, how long will it take to run many (different or same) applications simultaneously on a quad-core processor? Traditionally, improving throughput scalability does not fall into the purview of an optimizing compiler. We thought about this problem and came up with some ideas to tackle it. Version 4.2.3 is the first release of the x86 Open64 Compiler in which we incorporated these ideas. A new compiler flag, “-mso” (multi-core scalability optimization), has been added to control these optimizations.

While it is true that each individual core of a multi-core processor is fairly self-contained (for example, each core has its own instruction pipeline, its own L1 and L2 caches, etc.), there are some important resources that are shared among all the cores. Consider the six-core AMD Opteron™ processor, for example: all six cores share the same 6 MB L3 cache, and all six cores share the same memory bus and memory controller. Different compiler optimizations can affect these shared resources differently. For example, aggressive prefetching may speed up an application running on a single-core processor, but -- if multiple applications are running simultaneously on a multi-core processor -- these additional prefetch instructions may serve only to saturate the memory bus and overburden the memory controller, degrading throughput performance. Other optimizations, such as structure optimization and proactive loop nest optimization, may have the opposite effect.

Another consideration is the overhead of the optimizations. As mentioned in the previous sections, some of these optimizations do incur a slight overhead on the generated code. However, when many applications (or many copies of an application) are running on a multi-core processor, the performance benefits from these optimizations (for example, improved cache utilization) most often far outweigh the overhead. This may not be the case when only one such application is running on a single-core processor (thus enjoying exclusive use of the L3 cache), when the additional overhead may offset any slight performance gains obtained from the optimizations. The bottom line is that specifying the “-mso” flag instructs the compiler to optimize for throughput scalability by considering these performance trade-offs and responding accordingly.

Summary

We have described in detail some of the new optimizations offered in version 4.2.3 of the x86 Open64 Compiler. There are many other optimizations in the compiler that are readily available to performance users for tuning their applications. More information about the x86 Open64 Compiler Suite from AMD, including a user’s guide containing descriptions of other optimizations as well as a list of all the compilation options and flags, can be found on the x86 Open64 Compiler Suite main page.

Back to top
2010 Advanced Micro Devices, Inc. AMD, the AMD Arrow logo, AMD Opteron, AMD Athlon, AMD Turion, AMD Sempron, AMD Phenom, ATI Radeon, Catalyst, AMD LIVE!, and combinations thereof, are trademarks of Advanced Micro Devices, Inc. Microsoft and Windows are registered trademarks of Microsoft Corporation in the United States and/or other jurisdictions. Linux is a registered trademark of Linus Torvalds. Other names are for informational purposes only and may be trademarks of their respective owners.

This website may be linked to other websites which are not in the control of and are not maintained by AMD. AMD is not responsible for the content of those sites. AMD provides these links to you only as a convenience, and the inclusion of any link to such sites does not imply endorsement by AMD of those sites. AMD reserves the right to terminate any link or linking program at any time.
Now Available

AMD x86 Open64

Download the latest release.

Optimization Flags at a Glance

Get a quick guide to the optimization flags for the x86 Open64 compiler, in table format for easy reference.