123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170 |
- =====================================
- Performance Tips for Frontend Authors
- =====================================
- .. contents::
- :local:
- :depth: 2
- Abstract
- ========
- The intended audience of this document is developers of language frontends
- targeting LLVM IR. This document is home to a collection of tips on how to
- generate IR that optimizes well. As with any optimizer, LLVM has its strengths
- and weaknesses. In some cases, surprisingly small changes in the source IR
- can have a large effect on the generated code.
- Avoid loads and stores of large aggregate type
- ================================================
- LLVM currently does not optimize well loads and stores of large :ref:`aggregate
- types <t_aggregate>` (i.e. structs and arrays). As an alternative, consider
- loading individual fields from memory.
- Aggregates that are smaller than the largest (performant) load or store
- instruction supported by the targeted hardware are well supported. These can
- be an effective way to represent collections of small packed fields.
- Prefer zext over sext when legal
- ==================================
- On some architectures (X86_64 is one), sign extension can involve an extra
- instruction whereas zero extension can be folded into a load. LLVM will try to
- replace a sext with a zext when it can be proven safe, but if you have
- information in your source language about the range of a integer value, it can
- be profitable to use a zext rather than a sext.
- Alternatively, you can :ref:`specify the range of the value using metadata
- <range-metadata>` and LLVM can do the sext to zext conversion for you.
- Zext GEP indices to machine register width
- ============================================
- Internally, LLVM often promotes the width of GEP indices to machine register
- width. When it does so, it will default to using sign extension (sext)
- operations for safety. If your source language provides information about
- the range of the index, you may wish to manually extend indices to machine
- register width using a zext instruction.
- Other things to consider
- =========================
- #. Make sure that a DataLayout is provided (this will likely become required in
- the near future, but is certainly important for optimization).
- #. Add nsw/nuw flags as appropriate. Reasoning about overflow is
- generally hard for an optimizer so providing these facts from the frontend
- can be very impactful.
- #. Use fast-math flags on floating point operations if legal. If you don't
- need strict IEEE floating point semantics, there are a number of additional
- optimizations that can be performed. This can be highly impactful for
- floating point intensive computations.
- #. Use inbounds on geps. This can help to disambiguate some aliasing queries.
- #. Add noalias/align/dereferenceable/nonnull to function arguments and return
- values as appropriate
- #. Mark functions as readnone/readonly or noreturn/nounwind when known. The
- optimizer will try to infer these flags, but may not always be able to.
- Manual annotations are particularly important for external functions that
- the optimizer can not analyze.
- #. Use ptrtoint/inttoptr sparingly (they interfere with pointer aliasing
- analysis), prefer GEPs
- #. Use the lifetime.start/lifetime.end and invariant.start/invariant.end
- intrinsics where possible. Common profitable uses are for stack like data
- structures (thus allowing dead store elimination) and for describing
- life times of allocas (thus allowing smaller stack sizes).
- #. Use pointer aliasing metadata, especially tbaa metadata, to communicate
- otherwise-non-deducible pointer aliasing facts
- #. Use the "most-private" possible linkage types for the functions being defined
- (private, internal or linkonce_odr preferably)
- #. Mark invariant locations using !invariant.load and TBAA's constant flags
- #. Prefer globals over inttoptr of a constant address - this gives you
- dereferencability information. In MCJIT, use getSymbolAddress to provide
- actual address.
- #. Be wary of ordered and atomic memory operations. They are hard to optimize
- and may not be well optimized by the current optimizer. Depending on your
- source language, you may consider using fences instead.
- #. If calling a function which is known to throw an exception (unwind), use
- an invoke with a normal destination which contains an unreachable
- instruction. This form conveys to the optimizer that the call returns
- abnormally. For an invoke which neither returns normally or requires unwind
- code in the current function, you can use a noreturn call instruction if
- desired. This is generally not required because the optimizer will convert
- an invoke with an unreachable unwind destination to a call instruction.
- #. If you language uses range checks, consider using the IRCE pass. It is not
- currently part of the standard pass order.
- #. For languages with numerous rarely executed guard conditions (e.g. null
- checks, type checks, range checks) consider adding an extra execution or
- two of LoopUnswith and LICM to your pass order. The standard pass order,
- which is tuned for C and C++ applications, may not be sufficient to remove
- all dischargeable checks from loops.
- #. Use profile metadata to indicate statically known cold paths, even if
- dynamic profiling information is not available. This can make a large
- difference in code placement and thus the performance of tight loops.
- #. When generating code for loops, try to avoid terminating the header block of
- the loop earlier than necessary. If the terminator of the loop header
- block is a loop exiting conditional branch, the effectiveness of LICM will
- be limited for loads not in the header. (This is due to the fact that LLVM
- may not know such a load is safe to speculatively execute and thus can't
- lift an otherwise loop invariant load unless it can prove the exiting
- condition is not taken.) It can be profitable, in some cases, to emit such
- instructions into the header even if they are not used along a rarely
- executed path that exits the loop. This guidance specifically does not
- apply if the condition which terminates the loop header is itself invariant,
- or can be easily discharged by inspecting the loop index variables.
- #. In hot loops, consider duplicating instructions from small basic blocks
- which end in highly predictable terminators into their successor blocks.
- If a hot successor block contains instructions which can be vectorized
- with the duplicated ones, this can provide a noticeable throughput
- improvement. Note that this is not always profitable and does involve a
- potentially large increase in code size.
- #. Avoid high in-degree basic blocks (e.g. basic blocks with dozens or hundreds
- of predecessors). Among other issues, the register allocator is known to
- perform badly with confronted with such structures. The only exception to
- this guidance is that a unified return block with high in-degree is fine.
- #. When checking a value against a constant, emit the check using a consistent
- comparison type. The GVN pass *will* optimize redundant equalities even if
- the type of comparison is inverted, but GVN only runs late in the pipeline.
- As a result, you may miss the opportunity to run other important
- optimizations. Improvements to EarlyCSE to remove this issue are tracked in
- Bug 23333.
- #. Avoid using arithmetic intrinsics unless you are *required* by your source
- language specification to emit a particular code sequence. The optimizer
- is quite good at reasoning about general control flow and arithmetic, it is
- not anywhere near as strong at reasoning about the various intrinsics. If
- profitable for code generation purposes, the optimizer will likely form the
- intrinsics itself late in the optimization pipeline. It is *very* rarely
- profitable to emit these directly in the language frontend. This item
- explicitly includes the use of the :ref:`overflow intrinsics <int_overflow>`.
- #. Avoid using the :ref:`assume intrinsic <int_assume>` until you've
- established that a) there's no other way to express the given fact and b)
- that fact is critical for optimization purposes. Assumes are a great
- prototyping mechanism, but they can have negative effects on both compile
- time and optimization effectiveness. The former is fixable with enough
- effort, but the later is fairly fundamental to their designed purpose.
- p.s. If you want to help improve this document, patches expanding any of the
- above items into standalone sections of their own with a more complete
- discussion would be very welcome.
|