Flang - the Fortran Compiler
The information contained herein is for informational purposes only, and is subject to change without notice. While every precaution has been taken in the preparation of this document, it may contain technical inaccuracies, omissions and typographical errors, and AMD is under no obligation to update or otherwise correct this information. Advanced Micro Devices, Inc. makes no representations or warranties with respect to the accuracy or completeness of the contents of this document, and assumes no liability of any kind, including the implied warranties of noninfringement, merchantability or fitness for particular purposes, with respect to the operation or use of AMD hardware, software or other products described herein. No license, including implied or arising by estoppel, to any intellectual property rights is granted by this document. Terms and limitations applicable to the purchase or use of AMD’s products are as set forth in a signed agreement between the parties or in AMD’s Standard Terms and Conditions of Sale. Any unauthorized copying, alteration, distribution, transmission, performance, display or other use of this material is prohibited.

**Trademarks**

AMD, the AMD Arrow logo, AMD-V, AMD Virtualization, and combinations thereof, are trademarks of Advanced Micro Devices, Inc.

Windows is a registered trademark of Microsoft Corporation.

Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies.

**Document Information**

Document Version: 1.0
Last Updated on: 12th March 2021
Contents

1 Synopsis .................................................................................................................................................. 1
  1.1 Description ........................................................................................................................................ 1
  1.1.1 IEEE-754 Support ......................................................................................................................... 1
  1.1.2 Code Generation and Optimization ............................................................................................... 1
  1.2 Options ............................................................................................................................................... 1
    1.2.1 Target Selection ............................................................................................................................. 3
    1.2.2 Code Generation ............................................................................................................................. 3
    1.2.3 Deprecated options ....................................................................................................................... 8
    1.2.4 Driver ............................................................................................................................................ 8
1 Synopsis

Flang [options] filename ...

1.1 Description

Flang is the Fortran front-end designed for an integration with LLVM and is suitable for interoperability with Clang/LLVM. Flang consists of the following two components:

- flang1 will be invoked by front-end driver which is responsible for transforming the Fortran programs into tokens, then the parser transforms these tokens into Abstract Syntax Tree (AST). This AST is then transformed into canonical form, which is used to generate ILM code.
- flang2 takes up this ILM code and transforms it into ILI, which is then optimized by the internal optimizer. The optimized ILI is then transformed into LLVM IR. Then, the front-end driver transfers this LLVM IR to LLVM optimizer for optimization and target code generation.

Note: AOCC’s flang extends the GitHub version with enhancements and stability.

1.1.1 IEEE-754 Support

The Flang compiler does not conform to IEEE-754 specifications when -Ofast or -ffast-math options are specified. The compiler will enable a range of optimizations that provide faster mathematical operations under -Ofast and -ffast-math mode of compilation.

1.1.2 Code Generation and Optimization

Flang relies on AOCC optimizer and code generator to transform the available LLVM IR and generate the best code for the target x86 platform.

1.2 Options

For a list of compiler options, use the following commands:

- $flang -help
- $flang -help-hidden

The Flang compiler supports all the clang compiler options and the following flang-specific compiler options:

<table>
<thead>
<tr>
<th>Option</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>-Kieee</td>
<td>It is enabled by default from AOCC 2.2.0. It instructs the compiler to conform to the IEEE-754 specifications. The compiler will perform floating-point operations in strict conformance with the IEEE 754 standard. Some optimizations are disabled when this option is specified.</td>
</tr>
<tr>
<td>-Menable-vectorize- pragmas=&lt;value&gt;</td>
<td>Honors the vectorization pragmas specified in the Fortran programs. The vectorization pragms vector, novector, and ivdep are supported in this release.</td>
</tr>
<tr>
<td>-no-flang-libs</td>
<td>Do not link against Flang libraries.</td>
</tr>
<tr>
<td>-mp</td>
<td>Enable OpenMP and link with OpenMP library libomp.</td>
</tr>
<tr>
<td>Option</td>
<td>Description</td>
</tr>
<tr>
<td>---------------------------</td>
<td>-----------------------------------------------------------------------------</td>
</tr>
<tr>
<td>-nomp</td>
<td>Do not link with OpenMP library <code>libomp</code>.</td>
</tr>
<tr>
<td>-Mbackslash</td>
<td>Treat backslash character like a c-style escape character.</td>
</tr>
<tr>
<td>-Mno-backslash</td>
<td>Treat backslash like any other character.</td>
</tr>
<tr>
<td>-Mbyteswapio</td>
<td>Swap byte-order for unformatted input/output.</td>
</tr>
<tr>
<td>-Mfixed</td>
<td>Assume fixed-format source.</td>
</tr>
<tr>
<td>-Mextend</td>
<td>Allow source lines up to 132 characters.</td>
</tr>
<tr>
<td>-Mfreeform</td>
<td>Assume free-format source.</td>
</tr>
<tr>
<td>-Mpreprocess</td>
<td>Run preprocessor for Fortran files.</td>
</tr>
<tr>
<td>-Mstandard</td>
<td>Check standard conformance.</td>
</tr>
<tr>
<td>-Msave</td>
<td>Assume all variables have SAVE attribute.</td>
</tr>
<tr>
<td>-module</td>
<td>Path to module file (-l also works).</td>
</tr>
<tr>
<td>-Mallocatable=95</td>
<td>Select Fortran 95 semantics for assignments to allocatable objects (default).</td>
</tr>
<tr>
<td>-Mallocatable=03</td>
<td>Select Fortran 03 semantics for assignments to allocatable objects.</td>
</tr>
<tr>
<td>-static-flang-libs</td>
<td>Link using static Flang libraries.</td>
</tr>
<tr>
<td>-M[n]daz</td>
<td>Treat denormalized numbers as zero.</td>
</tr>
<tr>
<td>-M[n]flushz</td>
<td>Set SSE to flush-to-zero mode.</td>
</tr>
<tr>
<td>-Mcache_align</td>
<td>Align large objects on cache-line boundaries.</td>
</tr>
<tr>
<td>-M[n]fprelaxed</td>
<td>This option is ignored.</td>
</tr>
<tr>
<td>-fdefault-integer-8</td>
<td>Treat INTEGER and LOGICAL as INTEGER<em>8 and LOGICAL</em>8.</td>
</tr>
<tr>
<td>-fdefault-real-8</td>
<td>Treat REAL as REAL*8.</td>
</tr>
<tr>
<td>-i8</td>
<td>Treat INTEGER and LOGICAL as INTEGER<em>8 and LOGICAL</em>8.</td>
</tr>
<tr>
<td>-r8</td>
<td>Treat REAL as REAL*8.</td>
</tr>
<tr>
<td>-fno-fortran-main</td>
<td>Do not link in Fortran main.</td>
</tr>
<tr>
<td>-Mrecursive</td>
<td>Allocate local variables on the stack; thus, allowing recursion. SAVEd, data-initialized, or namelist members are always allocated statically, regardless of the setting of this switch.</td>
</tr>
</tbody>
</table>
1.2.1 Target Selection

<table>
<thead>
<tr>
<th>Option</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>-march=&lt;cpu&gt;</td>
<td>Use it to specify if Flang must generate code for a specific processor family member and later. For example, if you specify -march=i486, the compiler can generate instructions that are valid on i486 and later processors, but which may not exist on the earlier ones.</td>
</tr>
<tr>
<td>-march=znver1</td>
<td>Use this architecture flag for enabling the best code generation and tuning for AMD Zen based x86 architecture. All the x86 Zen ISA and associated intrinsic are supported.</td>
</tr>
<tr>
<td>-march=znver2</td>
<td>Use this architecture flag for enabling the best code generation and tuning for AMD Zen2 based on x86 architecture. All x86 Zen2 ISA and associated intrinsic are supported.</td>
</tr>
<tr>
<td>-march=znver3</td>
<td>Use this architecture flag for enabling best code generation and tuning for AMD Zen3 based x86 architecture. All x86 Zen3 ISA and associated intrinsic are supported.</td>
</tr>
</tbody>
</table>

1.2.2 Code Generation

Use the following options to specify the optimization level:

<table>
<thead>
<tr>
<th>Level</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>-00</td>
<td>It means no optimization: this level compiles the fastest and generates the most debuggable code.</td>
</tr>
<tr>
<td>-01</td>
<td>It is somewhere between the levels -00 and -02.</td>
</tr>
<tr>
<td>-02</td>
<td>A moderate level of optimization, which enables most optimizations.</td>
</tr>
<tr>
<td>-03</td>
<td>It is similar to the level -02, except that it enables the optimizations, which take longer to perform or may generate larger code (in an attempt to make the program run faster).</td>
</tr>
<tr>
<td></td>
<td>The -03 level in AOCC has more optimizations when compared to the base LLVM version on which it is based. These optimizations include improved handling of indirect calls, advanced vectorization, and so on.</td>
</tr>
<tr>
<td>-Ofast</td>
<td>It enables all the optimizations from -03 along with other aggressive optimizations that may violate strict compliance with language standards.</td>
</tr>
<tr>
<td></td>
<td>The -Ofast level in AOCC has more optimizations when compared to the base LLVM version on which it is based. These optimizations include partial unswitching, improvements to inlining, unrolling, and so on.</td>
</tr>
<tr>
<td>-0s</td>
<td>It is similar to the level -02, but with extra optimizations to reduce the code size.</td>
</tr>
</tbody>
</table>
### Flang - the Fortran Compiler

#### Level Description

<table>
<thead>
<tr>
<th>Level</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>-0z</td>
<td>It is similar to the level -0s (and thus, -02), but reduces the code size further.</td>
</tr>
<tr>
<td>-0</td>
<td>It is equivalent to the level 02.</td>
</tr>
<tr>
<td>-04 and higher</td>
<td>It is equivalent to the level 03.</td>
</tr>
</tbody>
</table>

For more information on these options, refer [LLVM documentation](https://llvm.org/docs/LlvmOpenC Compiler.html).

The following optimizations are not present in LLVM and are specific to AOCC:

- **-fstruct-layout=[1,2,3,4,5,6,7]**
  - Analyzes the whole program to determine if the structures in the code can be peeled and if the pointer or integer fields in the structure can be compressed. If feasible, this optimization transforms the code to enable these improvements. This transformation is likely to improve cache utilization and memory bandwidth. It is expected to improve the scalability of programs executed on multiple cores.
  - This is effective only under `flto` as the whole program analysis is required to perform this optimization. You can choose different levels of aggressiveness with which this optimization can be applied to your application; with 1 being the least aggressive and 7 being the most aggressive level.
    - `fstruct-layout=1` enables structure peeling.
    - `fstruct-layout=2` enables structure peeling and selectively compresses self-referential pointers in these structures to 32-bit pointers, wherever safe.
    - `fstruct-layout=3` enables structure peeling and selectively compresses self-referential pointers in these structures to 16-bit pointers, wherever safe.
    - `fstruct-layout=4` enables structure peeling, pointer compression as in level 2 and further enables compression of structure fields, which are of integer type. This is performed under a strict safety check.
    - `fstruct-layout=5` enables structure peeling, pointer compression as in level 3 and further enables compression of structure fields which are of integer type. This is performed under a strict safety check.
    - `fstruct-layout=6` enables structure peeling, pointer compression as in level 2 and further enables compression of structure fields, which are of type 64-bit **signed int** or **unsigned int**. You must ensure that the values assigned to 64-bit **signed int** fields are in range -(2^31 - 1) to +(2^31 - 1) and 64-bit **unsigned int** fields are in the range 0 to +(2^31 - 1). Else, incorrect results may be obtained. This compression is performed without considering any safety analysis. So, you must ensure the safety based on the program compiled.
    - `fstruct-layout=7` enables structure peeling, pointer compression as in level 3 and further enables compression of structure fields, which are of type 64-bit **signed int** or **unsigned int**. You must ensure that the values assigned to 64-bit **signed int** fields are in range -(2^31 - 1) to +(2^31 - 1) and 64-bit **unsigned int** fields are in the range 0 to +(2^31 - 1). Else, incorrect results may be obtained. This compression is performed without considering any safety analysis. So, must ensure the safety based on the program compiled.
Note:

\texttt{fstruct-layout}=4 and \texttt{fstruct-layout}=5 are derived from \texttt{fstruct-layout}=2 and \texttt{fstruct-layout}=3 respectively, with the added feature of safe compression of integer fields in structures. Going from \texttt{fstruct-layout}=4 to \texttt{fstruct-layout}=5 may result in higher performance if the pointer values are such that the pointers can be compressed to 16-bits.

\texttt{fstruct-layout}=6 and \texttt{fstruct-layout}=7 are derived from \texttt{fstruct-layout}=2 and \texttt{fstruct-layout}=3 respectively, with the added feature of compression of the integer fields in structures. These are similar to \texttt{fstruct-layout}=4 and \texttt{fstruct-layout}=5, but here, the integer fields of the structures are always compressed from 64-bits to 32-bits, without any safety guarantee.

- \texttt{-fitodcalls}
  It promotes indirect to direct calls by placing conditional calls. Application or benchmarks that have small and deterministic set of target functions for function pointers that are passed as call parameters benefit from this optimization. Indirect-to-direct call promotion transforms the code to use all possible determined targets under runtime checks and falls back to the original code for all the other cases. Runtime checks are introduced by the compiler for each of these possible function pointer targets followed by direct calls to the targets. This is a link time optimization, which is invoked as \texttt{-flto -fitodcalls}.

- \texttt{-fitodcallsbyclone}
  Performs value specialization for functions with function pointers passed as an argument. It does this specialization by generating a clone of the function. The cloning of the function happens in the call chain as needed to allow conversion of indirect function call to direct call. This complement \texttt{-fitodcalls} optimization and is also a link time optimization, which is invoked as \texttt{-flto -fitodcallsbyclone}.

- \texttt{-fremap-arrays}
  Transforms the data layout of a single dimensional array to provide better cache locality. This optimization is effective only under \texttt{flto} as the whole program analysis is required to perform this optimization, which can be invoked as \texttt{-flto -fremap-arrays}.

- \texttt{-finline-aggressive}
  Enables improved inlining capability through better heuristics. This optimization is more effective when using with \texttt{flto} as the whole program analysis is required to perform this optimization, which can be invoked as \texttt{-flto -finline-aggressive}.

The following optimization options must be invoked through driver \texttt{-mllvm <options>} as follows:

- \texttt{-enable-partial-unswitch}
  Enables partial loop un-switching, which is an enhancement to the existing loop unswitching optimization in LLVM. Partial loop un-switching hoists a condition inside a loop from a path for which the execution condition remains invariant, whereas the original loop un-switching works for a condition that is completely loop invariant. The condition inside the loop gets hoisted out from the invariant path and original loop is retained for the path where condition is variant.

- \texttt{-aggressive-loop-unswitch}
  Experimental option which enables aggressive loop unswitching heuristic (including \texttt{-enable-partial-unswitch}) based on the usage of the branch conditional values. Loop
unswitching leads to code-bloat. Code-bloat can be minimized if the hoisted condition is executed more often. This heuristic prioritizes the conditions based on the number of times they are used within the loop. The heuristic can be controlled with the following options:

- `unswitch-identical-branches-min-count=<n>`
  Enables unswitching of a loop with respect to a branch conditional value (B), where B appears in at least `<n>` compares in the loop. This option is enabled with `- aggressive-loop-unswitch`. The default value is 3.

**Usage:** `-mllvm -aggressive-loop-unswitch -mllvm -unswitch-identical-branches-min-count=<n>`
Where, `n` is a positive integer and lower value of `<n>` facilitates more unswitching.

- `unswitch-identical-branches-max-count=<n>`
  Enables unswitching of a loop with respect to a branch conditional value (B), where B appears in at most `<n>` compares in the loop. This option is enabled with `- aggressive-loop-unswitch`. The default value is 6.

**Usage:** `-mllvm -aggressive-loop-unswitch -mllvm -unswitch-identical-branches-max-count=<n>`
Where, `n` is a positive integer and higher value of `<n>` facilitates more unswitching.

- `-enable-strided-vectorization`
  Enables strided memory vectorization as an enhancement to the interleaved vectorization framework present in LLVM. It enables the effective use of gather and scatter kind of instruction patterns. This flag must be used along with the interleave vectorization flag.

- `-enable-epilog-vectorization`
  Enables vectorization of epilog-iterations as an enhancement to existing vectorization framework. This enables generation of an additional epilog vector loop version for the remainder iterations of the original vector loop. The vector size or factor of the original loop should be large enough to allow an effective epilog vectorization of the remaining iterations. This optimization takes place only when the original vector loop is vectorized with a vector width or factor of sixteen. This vectorization width of sixteen may be overwritten by `-min-width-epilog-vectorization` command line option.

- `-enable-redundant-movs`
  Removes any redundant mov operations including redundant loads from memory and stores to memory. This can be invoked using `-Wl,-plugin-opt=-enable-redundant-movs`.

- `-merge-constant`
  Attempts to promote frequently occurring constants to registers. The aim is to reduce the size of the instruction encoding for instructions using constants and obtain a performance improvement.

- `-function-specialize`
  Optimizes the functions with compile time constant formal arguments.

- `-lv-function-specialization`
  Generates specialized function versions when the loops inside function are vectorizable and the arguments are not aliased with each other.

- `-enable-vectorize-compares`
  Enables vectorization on certain loops with conditional breaks assuming the memory access are safely bound within the page boundary.
- **-inline-recursion=[1,2,3,4]**  
  Enables inlining for recursive functions based on heuristics with level 4 being most aggressive. The default level will be 2. Higher levels may lead to code-bloat due to expansion of recursive functions at call sites.
  - For level 1-2: Enables inlining for recursive functions using heuristics with inline depth 1. Level 2 uses more aggressive heuristics.
  - For level 3: Enables inlining for all recursive functions with inline depth 1.
  - For level 4: Enables inlining for all recursive function with inline depth 10.
  This is more effective with flio as the whole program analysis is required to perform this optimization, which can be invoked as `-flto -inline-recursion=[1,2,3,4]`.

- **-reduce-array-computations=[1,2,3]**  
  Performs array dataflow analysis and optimizes the unused array computations.
  - reduce-array-computations=1: Eliminates the computations on unused array elements.
  - reduce-array-computations=2: Eliminates the computations on zero valued array elements.
  - reduce-array-computations=3: Eliminates the computations on unused and zero valued array elements (combination of 1 and 2).
  This optimization is effective with flio as the whole program analysis is required to perform this optimization, which can be invoked as `-flto -reduce-array-computations=[1,2,3]`.

- **-global-vectorize-slp={true,false}**  
  Vectorizes the straight-line code inside a basic block with data reordering vector operations. This option is set to `true` by default.

- **-region-vectorize**  
  Experimental flag for enabling vectorization on certain loops with complex control flow which the normal vectorizer cannot handle.
  This optimization is effective with flio as the whole program analysis is required to perform this optimization, which can be invoked as `-flto -region-vectorize`.

- **-enable-x86-prefetching**  
  Enables the generation of x86 prefetch instruction for the memory references inside a loop/inside an inner most loop of a loop nest to prefetch the second dimension of multidimensional array/memory references in the inner most of a loop nest. This is an experimental pass; its profitability is being improved.

- **-suppress-fmas**  
  Identifies the reduction patterns on FMA and suppresses the FMA generation as it is not profitable on the reduction patterns.

- **-enable-l icm-vrp**  
  Enables estimation of the virtual register pressure before performing loop invariant code motion. This estimation is used to control the number of loop invariants that will be hoisted during the loop invariant code motion.

- **-loop-splitting**  
  Enables splitting of loops into multiple loops to eliminate the branches, which compare the loop induction with an invariant or constant expression. This option is enabled under `-03` by default. To disable this optimization, use `-loop-splitting=false`. 
- **-enable-ipo-loop-split**
  Enables splitting of loops into multiple loops to eliminate the branches, which compares the loop induction with a constant expression. This constant expression can be derived through inter-procedural analysis. This option is enabled under `-O3` by default. To disable this optimization, use `-enable-ipo-loop-split=false`.

- **-compute-interchange-order**
  Enables heuristic for finding the best possible interchange order for a loop nest. To enable this option, use `-enable-loopinterchange`. This option is set to `false` by default.

**Usage:** `-mllvm -enable-loopinterchange -mllvm -compute-interchange-order`

- **-convert-pow-exp-to-int={true,false}**
  Converts the call to floating point exponent version of `pow` to its integer exponent version if the floating-point exponent can be converted to integer. This option is set to `true` by default.

- **-do-block-reordering={none,normal,aggressive}**
  Reorders the control predicates in increasing order of complexity from outer predicate to inner when it is safe. The `normal` mode reorders simple expressions while the `aggressive` mode will reorder predicates involving function calls if it can determine that they have no side-effects. This option is set to `normal` by default.

- **-fuse-tile-inner-loop**
  Enables fusion of adjacent tiled loops as a part of loop tiling transformation. This option is set to `false` by default.

- **-Hz,1,0x1**
  Helps to preserve array index information for array access expressions which get linearized in the compiler frontend. The preserved information is used by the compiler optimization phase in performing optimizations such as loop transformations. It is recommended that any user who is using optimizations such as loop transformations and other optimizations requiring de-linearized index expressions should use the Hz option. This option has no impact on any other aspects of AOCC’s Flang frontend.

### 1.2.3 Deprecated options

- `-vectorize-memory-aggressively` *(from AOCC 2.2.0)*

### 1.2.4 Driver

- `-mllvm <options>`
  Need to provide `-mllvm`, so that, the option can pass through the compiler front end and is applied on the optimizer where this optimization is implemented.
  For example, `-mllvm -enable-strided-vectorization`