| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459 |
- ==============================================
- LLVM Atomic Instructions and Concurrency Guide
- ==============================================
- .. contents::
- :local:
- Introduction
- ============
- Historically, LLVM has not had very strong support for concurrency; some minimal
- intrinsics were provided, and ``volatile`` was used in some cases to achieve
- rough semantics in the presence of concurrency. However, this is changing;
- there are now new instructions which are well-defined in the presence of threads
- and asynchronous signals, and the model for existing instructions has been
- clarified in the IR.
- The atomic instructions are designed specifically to provide readable IR and
- optimized code generation for the following:
- * The new C++11 ``<atomic>`` header. (`C++11 draft available here
- <http://www.open-std.org/jtc1/sc22/wg21/>`_.) (`C11 draft available here
- <http://www.open-std.org/jtc1/sc22/wg14/>`_.)
- * Proper semantics for Java-style memory, for both ``volatile`` and regular
- shared variables. (`Java Specification
- <http://docs.oracle.com/javase/specs/jls/se8/html/jls-17.html>`_)
- * gcc-compatible ``__sync_*`` builtins. (`Description
- <https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html>`_)
- * Other scenarios with atomic semantics, including ``static`` variables with
- non-trivial constructors in C++.
- Atomic and volatile in the IR are orthogonal; "volatile" is the C/C++ volatile,
- which ensures that every volatile load and store happens and is performed in the
- stated order. A couple examples: if a SequentiallyConsistent store is
- immediately followed by another SequentiallyConsistent store to the same
- address, the first store can be erased. This transformation is not allowed for a
- pair of volatile stores. On the other hand, a non-volatile non-atomic load can
- be moved across a volatile load freely, but not an Acquire load.
- This document is intended to provide a guide to anyone either writing a frontend
- for LLVM or working on optimization passes for LLVM with a guide for how to deal
- with instructions with special semantics in the presence of concurrency. This
- is not intended to be a precise guide to the semantics; the details can get
- extremely complicated and unreadable, and are not usually necessary.
- .. _Optimization outside atomic:
- Optimization outside atomic
- ===========================
- The basic ``'load'`` and ``'store'`` allow a variety of optimizations, but can
- lead to undefined results in a concurrent environment; see `NotAtomic`_. This
- section specifically goes into the one optimizer restriction which applies in
- concurrent environments, which gets a bit more of an extended description
- because any optimization dealing with stores needs to be aware of it.
- From the optimizer's point of view, the rule is that if there are not any
- instructions with atomic ordering involved, concurrency does not matter, with
- one exception: if a variable might be visible to another thread or signal
- handler, a store cannot be inserted along a path where it might not execute
- otherwise. Take the following example:
- .. code-block:: c
- /* C code, for readability; run through clang -O2 -S -emit-llvm to get
- equivalent IR */
- int x;
- void f(int* a) {
- for (int i = 0; i < 100; i++) {
- if (a[i])
- x += 1;
- }
- }
- The following is equivalent in non-concurrent situations:
- .. code-block:: c
- int x;
- void f(int* a) {
- int xtemp = x;
- for (int i = 0; i < 100; i++) {
- if (a[i])
- xtemp += 1;
- }
- x = xtemp;
- }
- However, LLVM is not allowed to transform the former to the latter: it could
- indirectly introduce undefined behavior if another thread can access ``x`` at
- the same time. (This example is particularly of interest because before the
- concurrency model was implemented, LLVM would perform this transformation.)
- Note that speculative loads are allowed; a load which is part of a race returns
- ``undef``, but does not have undefined behavior.
- Atomic instructions
- ===================
- For cases where simple loads and stores are not sufficient, LLVM provides
- various atomic instructions. The exact guarantees provided depend on the
- ordering; see `Atomic orderings`_.
- ``load atomic`` and ``store atomic`` provide the same basic functionality as
- non-atomic loads and stores, but provide additional guarantees in situations
- where threads and signals are involved.
- ``cmpxchg`` and ``atomicrmw`` are essentially like an atomic load followed by an
- atomic store (where the store is conditional for ``cmpxchg``), but no other
- memory operation can happen on any thread between the load and store.
- A ``fence`` provides Acquire and/or Release ordering which is not part of
- another operation; it is normally used along with Monotonic memory operations.
- A Monotonic load followed by an Acquire fence is roughly equivalent to an
- Acquire load, and a Monotonic store following a Release fence is roughly
- equivalent to a Release store. SequentiallyConsistent fences behave as both
- an Acquire and a Release fence, and offer some additional complicated
- guarantees, see the C++11 standard for details.
- Frontends generating atomic instructions generally need to be aware of the
- target to some degree; atomic instructions are guaranteed to be lock-free, and
- therefore an instruction which is wider than the target natively supports can be
- impossible to generate.
- .. _Atomic orderings:
- Atomic orderings
- ================
- In order to achieve a balance between performance and necessary guarantees,
- there are six levels of atomicity. They are listed in order of strength; each
- level includes all the guarantees of the previous level except for
- Acquire/Release. (See also `LangRef Ordering <LangRef.html#ordering>`_.)
- .. _NotAtomic:
- NotAtomic
- ---------
- NotAtomic is the obvious, a load or store which is not atomic. (This isn't
- really a level of atomicity, but is listed here for comparison.) This is
- essentially a regular load or store. If there is a race on a given memory
- location, loads from that location return undef.
- Relevant standard
- This is intended to match shared variables in C/C++, and to be used in any
- other context where memory access is necessary, and a race is impossible. (The
- precise definition is in `LangRef Memory Model <LangRef.html#memmodel>`_.)
- Notes for frontends
- The rule is essentially that all memory accessed with basic loads and stores
- by multiple threads should be protected by a lock or other synchronization;
- otherwise, you are likely to run into undefined behavior. If your frontend is
- for a "safe" language like Java, use Unordered to load and store any shared
- variable. Note that NotAtomic volatile loads and stores are not properly
- atomic; do not try to use them as a substitute. (Per the C/C++ standards,
- volatile does provide some limited guarantees around asynchronous signals, but
- atomics are generally a better solution.)
- Notes for optimizers
- Introducing loads to shared variables along a codepath where they would not
- otherwise exist is allowed; introducing stores to shared variables is not. See
- `Optimization outside atomic`_.
- Notes for code generation
- The one interesting restriction here is that it is not allowed to write to
- bytes outside of the bytes relevant to a store. This is mostly relevant to
- unaligned stores: it is not allowed in general to convert an unaligned store
- into two aligned stores of the same width as the unaligned store. Backends are
- also expected to generate an i8 store as an i8 store, and not an instruction
- which writes to surrounding bytes. (If you are writing a backend for an
- architecture which cannot satisfy these restrictions and cares about
- concurrency, please send an email to llvm-dev.)
- Unordered
- ---------
- Unordered is the lowest level of atomicity. It essentially guarantees that races
- produce somewhat sane results instead of having undefined behavior. It also
- guarantees the operation to be lock-free, so it does not depend on the data
- being part of a special atomic structure or depend on a separate per-process
- global lock. Note that code generation will fail for unsupported atomic
- operations; if you need such an operation, use explicit locking.
- Relevant standard
- This is intended to match the Java memory model for shared variables.
- Notes for frontends
- This cannot be used for synchronization, but is useful for Java and other
- "safe" languages which need to guarantee that the generated code never
- exhibits undefined behavior. Note that this guarantee is cheap on common
- platforms for loads of a native width, but can be expensive or unavailable for
- wider loads, like a 64-bit store on ARM. (A frontend for Java or other "safe"
- languages would normally split a 64-bit store on ARM into two 32-bit unordered
- stores.)
- Notes for optimizers
- In terms of the optimizer, this prohibits any transformation that transforms a
- single load into multiple loads, transforms a store into multiple stores,
- narrows a store, or stores a value which would not be stored otherwise. Some
- examples of unsafe optimizations are narrowing an assignment into a bitfield,
- rematerializing a load, and turning loads and stores into a memcpy
- call. Reordering unordered operations is safe, though, and optimizers should
- take advantage of that because unordered operations are common in languages
- that need them.
- Notes for code generation
- These operations are required to be atomic in the sense that if you use
- unordered loads and unordered stores, a load cannot see a value which was
- never stored. A normal load or store instruction is usually sufficient, but
- note that an unordered load or store cannot be split into multiple
- instructions (or an instruction which does multiple memory operations, like
- ``LDRD`` on ARM without LPAE, or not naturally-aligned ``LDRD`` on LPAE ARM).
- Monotonic
- ---------
- Monotonic is the weakest level of atomicity that can be used in synchronization
- primitives, although it does not provide any general synchronization. It
- essentially guarantees that if you take all the operations affecting a specific
- address, a consistent ordering exists.
- Relevant standard
- This corresponds to the C++11/C11 ``memory_order_relaxed``; see those
- standards for the exact definition.
- Notes for frontends
- If you are writing a frontend which uses this directly, use with caution. The
- guarantees in terms of synchronization are very weak, so make sure these are
- only used in a pattern which you know is correct. Generally, these would
- either be used for atomic operations which do not protect other memory (like
- an atomic counter), or along with a ``fence``.
- Notes for optimizers
- In terms of the optimizer, this can be treated as a read+write on the relevant
- memory location (and alias analysis will take advantage of that). In addition,
- it is legal to reorder non-atomic and Unordered loads around Monotonic
- loads. CSE/DSE and a few other optimizations are allowed, but Monotonic
- operations are unlikely to be used in ways which would make those
- optimizations useful.
- Notes for code generation
- Code generation is essentially the same as that for unordered for loads and
- stores. No fences are required. ``cmpxchg`` and ``atomicrmw`` are required
- to appear as a single operation.
- Acquire
- -------
- Acquire provides a barrier of the sort necessary to acquire a lock to access
- other memory with normal loads and stores.
- Relevant standard
- This corresponds to the C++11/C11 ``memory_order_acquire``. It should also be
- used for C++11/C11 ``memory_order_consume``.
- Notes for frontends
- If you are writing a frontend which uses this directly, use with caution.
- Acquire only provides a semantic guarantee when paired with a Release
- operation.
- Notes for optimizers
- Optimizers not aware of atomics can treat this like a nothrow call. It is
- also possible to move stores from before an Acquire load or read-modify-write
- operation to after it, and move non-Acquire loads from before an Acquire
- operation to after it.
- Notes for code generation
- Architectures with weak memory ordering (essentially everything relevant today
- except x86 and SPARC) require some sort of fence to maintain the Acquire
- semantics. The precise fences required varies widely by architecture, but for
- a simple implementation, most architectures provide a barrier which is strong
- enough for everything (``dmb`` on ARM, ``sync`` on PowerPC, etc.). Putting
- such a fence after the equivalent Monotonic operation is sufficient to
- maintain Acquire semantics for a memory operation.
- Release
- -------
- Release is similar to Acquire, but with a barrier of the sort necessary to
- release a lock.
- Relevant standard
- This corresponds to the C++11/C11 ``memory_order_release``.
- Notes for frontends
- If you are writing a frontend which uses this directly, use with caution.
- Release only provides a semantic guarantee when paired with a Acquire
- operation.
- Notes for optimizers
- Optimizers not aware of atomics can treat this like a nothrow call. It is
- also possible to move loads from after a Release store or read-modify-write
- operation to before it, and move non-Release stores from after an Release
- operation to before it.
- Notes for code generation
- See the section on Acquire; a fence before the relevant operation is usually
- sufficient for Release. Note that a store-store fence is not sufficient to
- implement Release semantics; store-store fences are generally not exposed to
- IR because they are extremely difficult to use correctly.
- AcquireRelease
- --------------
- AcquireRelease (``acq_rel`` in IR) provides both an Acquire and a Release
- barrier (for fences and operations which both read and write memory).
- Relevant standard
- This corresponds to the C++11/C11 ``memory_order_acq_rel``.
- Notes for frontends
- If you are writing a frontend which uses this directly, use with caution.
- Acquire only provides a semantic guarantee when paired with a Release
- operation, and vice versa.
- Notes for optimizers
- In general, optimizers should treat this like a nothrow call; the possible
- optimizations are usually not interesting.
- Notes for code generation
- This operation has Acquire and Release semantics; see the sections on Acquire
- and Release.
- SequentiallyConsistent
- ----------------------
- SequentiallyConsistent (``seq_cst`` in IR) provides Acquire semantics for loads
- and Release semantics for stores. Additionally, it guarantees that a total
- ordering exists between all SequentiallyConsistent operations.
- Relevant standard
- This corresponds to the C++11/C11 ``memory_order_seq_cst``, Java volatile, and
- the gcc-compatible ``__sync_*`` builtins which do not specify otherwise.
- Notes for frontends
- If a frontend is exposing atomic operations, these are much easier to reason
- about for the programmer than other kinds of operations, and using them is
- generally a practical performance tradeoff.
- Notes for optimizers
- Optimizers not aware of atomics can treat this like a nothrow call. For
- SequentiallyConsistent loads and stores, the same reorderings are allowed as
- for Acquire loads and Release stores, except that SequentiallyConsistent
- operations may not be reordered.
- Notes for code generation
- SequentiallyConsistent loads minimally require the same barriers as Acquire
- operations and SequentiallyConsistent stores require Release
- barriers. Additionally, the code generator must enforce ordering between
- SequentiallyConsistent stores followed by SequentiallyConsistent loads. This
- is usually done by emitting either a full fence before the loads or a full
- fence after the stores; which is preferred varies by architecture.
- Atomics and IR optimization
- ===========================
- Predicates for optimizer writers to query:
- * ``isSimple()``: A load or store which is not volatile or atomic. This is
- what, for example, memcpyopt would check for operations it might transform.
- * ``isUnordered()``: A load or store which is not volatile and at most
- Unordered. This would be checked, for example, by LICM before hoisting an
- operation.
- * ``mayReadFromMemory()``/``mayWriteToMemory()``: Existing predicate, but note
- that they return true for any operation which is volatile or at least
- Monotonic.
- * ``isAtLeastAcquire()``/``isAtLeastRelease()``: These are predicates on
- orderings. They can be useful for passes that are aware of atomics, for
- example to do DSE across a single atomic access, but not across a
- release-acquire pair (see MemoryDependencyAnalysis for an example of this)
- * Alias analysis: Note that AA will return ModRef for anything Acquire or
- Release, and for the address accessed by any Monotonic operation.
- To support optimizing around atomic operations, make sure you are using the
- right predicates; everything should work if that is done. If your pass should
- optimize some atomic operations (Unordered operations in particular), make sure
- it doesn't replace an atomic load or store with a non-atomic operation.
- Some examples of how optimizations interact with various kinds of atomic
- operations:
- * ``memcpyopt``: An atomic operation cannot be optimized into part of a
- memcpy/memset, including unordered loads/stores. It can pull operations
- across some atomic operations.
- * LICM: Unordered loads/stores can be moved out of a loop. It just treats
- monotonic operations like a read+write to a memory location, and anything
- stricter than that like a nothrow call.
- * DSE: Unordered stores can be DSE'ed like normal stores. Monotonic stores can
- be DSE'ed in some cases, but it's tricky to reason about, and not especially
- important. It is possible in some case for DSE to operate across a stronger
- atomic operation, but it is fairly tricky. DSE delegates this reasoning to
- MemoryDependencyAnalysis (which is also used by other passes like GVN).
- * Folding a load: Any atomic load from a constant global can be constant-folded,
- because it cannot be observed. Similar reasoning allows scalarrepl with
- atomic loads and stores.
- Atomics and Codegen
- ===================
- Atomic operations are represented in the SelectionDAG with ``ATOMIC_*`` opcodes.
- On architectures which use barrier instructions for all atomic ordering (like
- ARM), appropriate fences can be emitted by the AtomicExpand Codegen pass if
- ``setInsertFencesForAtomic()`` was used.
- The MachineMemOperand for all atomic operations is currently marked as volatile;
- this is not correct in the IR sense of volatile, but CodeGen handles anything
- marked volatile very conservatively. This should get fixed at some point.
- Common architectures have some way of representing at least a pointer-sized
- lock-free ``cmpxchg``; such an operation can be used to implement all the other
- atomic operations which can be represented in IR up to that size. Backends are
- expected to implement all those operations, but not operations which cannot be
- implemented in a lock-free manner. It is expected that backends will give an
- error when given an operation which cannot be implemented. (The LLVM code
- generator is not very helpful here at the moment, but hopefully that will
- change.)
- On x86, all atomic loads generate a ``MOV``. SequentiallyConsistent stores
- generate an ``XCHG``, other stores generate a ``MOV``. SequentiallyConsistent
- fences generate an ``MFENCE``, other fences do not cause any code to be
- generated. cmpxchg uses the ``LOCK CMPXCHG`` instruction. ``atomicrmw xchg``
- uses ``XCHG``, ``atomicrmw add`` and ``atomicrmw sub`` use ``XADD``, and all
- other ``atomicrmw`` operations generate a loop with ``LOCK CMPXCHG``. Depending
- on the users of the result, some ``atomicrmw`` operations can be translated into
- operations like ``LOCK AND``, but that does not work in general.
- On ARM (before v8), MIPS, and many other RISC architectures, Acquire, Release,
- and SequentiallyConsistent semantics require barrier instructions for every such
- operation. Loads and stores generate normal instructions. ``cmpxchg`` and
- ``atomicrmw`` can be represented using a loop with LL/SC-style instructions
- which take some sort of exclusive lock on a cache line (``LDREX`` and ``STREX``
- on ARM, etc.).
- It is often easiest for backends to use AtomicExpandPass to lower some of the
- atomic constructs. Here are some lowerings it can do:
- * cmpxchg -> loop with load-linked/store-conditional
- by overriding ``hasLoadLinkedStoreConditional()``, ``emitLoadLinked()``,
- ``emitStoreConditional()``
- * large loads/stores -> ll-sc/cmpxchg
- by overriding ``shouldExpandAtomicStoreInIR()``/``shouldExpandAtomicLoadInIR()``
- * strong atomic accesses -> monotonic accesses + fences
- by using ``setInsertFencesForAtomic()`` and overriding ``emitLeadingFence()``
- and ``emitTrailingFence()``
- * atomic rmw -> loop with cmpxchg or load-linked/store-conditional
- by overriding ``expandAtomicRMWInIR()``
- For an example of all of these, look at the ARM backend.
|