Atomics.rst 20 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459
  1. ==============================================
  2. LLVM Atomic Instructions and Concurrency Guide
  3. ==============================================
  4. .. contents::
  5. :local:
  6. Introduction
  7. ============
  8. Historically, LLVM has not had very strong support for concurrency; some minimal
  9. intrinsics were provided, and ``volatile`` was used in some cases to achieve
  10. rough semantics in the presence of concurrency. However, this is changing;
  11. there are now new instructions which are well-defined in the presence of threads
  12. and asynchronous signals, and the model for existing instructions has been
  13. clarified in the IR.
  14. The atomic instructions are designed specifically to provide readable IR and
  15. optimized code generation for the following:
  16. * The new C++11 ``<atomic>`` header. (`C++11 draft available here
  17. <http://www.open-std.org/jtc1/sc22/wg21/>`_.) (`C11 draft available here
  18. <http://www.open-std.org/jtc1/sc22/wg14/>`_.)
  19. * Proper semantics for Java-style memory, for both ``volatile`` and regular
  20. shared variables. (`Java Specification
  21. <http://docs.oracle.com/javase/specs/jls/se8/html/jls-17.html>`_)
  22. * gcc-compatible ``__sync_*`` builtins. (`Description
  23. <https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html>`_)
  24. * Other scenarios with atomic semantics, including ``static`` variables with
  25. non-trivial constructors in C++.
  26. Atomic and volatile in the IR are orthogonal; "volatile" is the C/C++ volatile,
  27. which ensures that every volatile load and store happens and is performed in the
  28. stated order. A couple examples: if a SequentiallyConsistent store is
  29. immediately followed by another SequentiallyConsistent store to the same
  30. address, the first store can be erased. This transformation is not allowed for a
  31. pair of volatile stores. On the other hand, a non-volatile non-atomic load can
  32. be moved across a volatile load freely, but not an Acquire load.
  33. This document is intended to provide a guide to anyone either writing a frontend
  34. for LLVM or working on optimization passes for LLVM with a guide for how to deal
  35. with instructions with special semantics in the presence of concurrency. This
  36. is not intended to be a precise guide to the semantics; the details can get
  37. extremely complicated and unreadable, and are not usually necessary.
  38. .. _Optimization outside atomic:
  39. Optimization outside atomic
  40. ===========================
  41. The basic ``'load'`` and ``'store'`` allow a variety of optimizations, but can
  42. lead to undefined results in a concurrent environment; see `NotAtomic`_. This
  43. section specifically goes into the one optimizer restriction which applies in
  44. concurrent environments, which gets a bit more of an extended description
  45. because any optimization dealing with stores needs to be aware of it.
  46. From the optimizer's point of view, the rule is that if there are not any
  47. instructions with atomic ordering involved, concurrency does not matter, with
  48. one exception: if a variable might be visible to another thread or signal
  49. handler, a store cannot be inserted along a path where it might not execute
  50. otherwise. Take the following example:
  51. .. code-block:: c
  52. /* C code, for readability; run through clang -O2 -S -emit-llvm to get
  53. equivalent IR */
  54. int x;
  55. void f(int* a) {
  56. for (int i = 0; i < 100; i++) {
  57. if (a[i])
  58. x += 1;
  59. }
  60. }
  61. The following is equivalent in non-concurrent situations:
  62. .. code-block:: c
  63. int x;
  64. void f(int* a) {
  65. int xtemp = x;
  66. for (int i = 0; i < 100; i++) {
  67. if (a[i])
  68. xtemp += 1;
  69. }
  70. x = xtemp;
  71. }
  72. However, LLVM is not allowed to transform the former to the latter: it could
  73. indirectly introduce undefined behavior if another thread can access ``x`` at
  74. the same time. (This example is particularly of interest because before the
  75. concurrency model was implemented, LLVM would perform this transformation.)
  76. Note that speculative loads are allowed; a load which is part of a race returns
  77. ``undef``, but does not have undefined behavior.
  78. Atomic instructions
  79. ===================
  80. For cases where simple loads and stores are not sufficient, LLVM provides
  81. various atomic instructions. The exact guarantees provided depend on the
  82. ordering; see `Atomic orderings`_.
  83. ``load atomic`` and ``store atomic`` provide the same basic functionality as
  84. non-atomic loads and stores, but provide additional guarantees in situations
  85. where threads and signals are involved.
  86. ``cmpxchg`` and ``atomicrmw`` are essentially like an atomic load followed by an
  87. atomic store (where the store is conditional for ``cmpxchg``), but no other
  88. memory operation can happen on any thread between the load and store.
  89. A ``fence`` provides Acquire and/or Release ordering which is not part of
  90. another operation; it is normally used along with Monotonic memory operations.
  91. A Monotonic load followed by an Acquire fence is roughly equivalent to an
  92. Acquire load, and a Monotonic store following a Release fence is roughly
  93. equivalent to a Release store. SequentiallyConsistent fences behave as both
  94. an Acquire and a Release fence, and offer some additional complicated
  95. guarantees, see the C++11 standard for details.
  96. Frontends generating atomic instructions generally need to be aware of the
  97. target to some degree; atomic instructions are guaranteed to be lock-free, and
  98. therefore an instruction which is wider than the target natively supports can be
  99. impossible to generate.
  100. .. _Atomic orderings:
  101. Atomic orderings
  102. ================
  103. In order to achieve a balance between performance and necessary guarantees,
  104. there are six levels of atomicity. They are listed in order of strength; each
  105. level includes all the guarantees of the previous level except for
  106. Acquire/Release. (See also `LangRef Ordering <LangRef.html#ordering>`_.)
  107. .. _NotAtomic:
  108. NotAtomic
  109. ---------
  110. NotAtomic is the obvious, a load or store which is not atomic. (This isn't
  111. really a level of atomicity, but is listed here for comparison.) This is
  112. essentially a regular load or store. If there is a race on a given memory
  113. location, loads from that location return undef.
  114. Relevant standard
  115. This is intended to match shared variables in C/C++, and to be used in any
  116. other context where memory access is necessary, and a race is impossible. (The
  117. precise definition is in `LangRef Memory Model <LangRef.html#memmodel>`_.)
  118. Notes for frontends
  119. The rule is essentially that all memory accessed with basic loads and stores
  120. by multiple threads should be protected by a lock or other synchronization;
  121. otherwise, you are likely to run into undefined behavior. If your frontend is
  122. for a "safe" language like Java, use Unordered to load and store any shared
  123. variable. Note that NotAtomic volatile loads and stores are not properly
  124. atomic; do not try to use them as a substitute. (Per the C/C++ standards,
  125. volatile does provide some limited guarantees around asynchronous signals, but
  126. atomics are generally a better solution.)
  127. Notes for optimizers
  128. Introducing loads to shared variables along a codepath where they would not
  129. otherwise exist is allowed; introducing stores to shared variables is not. See
  130. `Optimization outside atomic`_.
  131. Notes for code generation
  132. The one interesting restriction here is that it is not allowed to write to
  133. bytes outside of the bytes relevant to a store. This is mostly relevant to
  134. unaligned stores: it is not allowed in general to convert an unaligned store
  135. into two aligned stores of the same width as the unaligned store. Backends are
  136. also expected to generate an i8 store as an i8 store, and not an instruction
  137. which writes to surrounding bytes. (If you are writing a backend for an
  138. architecture which cannot satisfy these restrictions and cares about
  139. concurrency, please send an email to llvm-dev.)
  140. Unordered
  141. ---------
  142. Unordered is the lowest level of atomicity. It essentially guarantees that races
  143. produce somewhat sane results instead of having undefined behavior. It also
  144. guarantees the operation to be lock-free, so it does not depend on the data
  145. being part of a special atomic structure or depend on a separate per-process
  146. global lock. Note that code generation will fail for unsupported atomic
  147. operations; if you need such an operation, use explicit locking.
  148. Relevant standard
  149. This is intended to match the Java memory model for shared variables.
  150. Notes for frontends
  151. This cannot be used for synchronization, but is useful for Java and other
  152. "safe" languages which need to guarantee that the generated code never
  153. exhibits undefined behavior. Note that this guarantee is cheap on common
  154. platforms for loads of a native width, but can be expensive or unavailable for
  155. wider loads, like a 64-bit store on ARM. (A frontend for Java or other "safe"
  156. languages would normally split a 64-bit store on ARM into two 32-bit unordered
  157. stores.)
  158. Notes for optimizers
  159. In terms of the optimizer, this prohibits any transformation that transforms a
  160. single load into multiple loads, transforms a store into multiple stores,
  161. narrows a store, or stores a value which would not be stored otherwise. Some
  162. examples of unsafe optimizations are narrowing an assignment into a bitfield,
  163. rematerializing a load, and turning loads and stores into a memcpy
  164. call. Reordering unordered operations is safe, though, and optimizers should
  165. take advantage of that because unordered operations are common in languages
  166. that need them.
  167. Notes for code generation
  168. These operations are required to be atomic in the sense that if you use
  169. unordered loads and unordered stores, a load cannot see a value which was
  170. never stored. A normal load or store instruction is usually sufficient, but
  171. note that an unordered load or store cannot be split into multiple
  172. instructions (or an instruction which does multiple memory operations, like
  173. ``LDRD`` on ARM without LPAE, or not naturally-aligned ``LDRD`` on LPAE ARM).
  174. Monotonic
  175. ---------
  176. Monotonic is the weakest level of atomicity that can be used in synchronization
  177. primitives, although it does not provide any general synchronization. It
  178. essentially guarantees that if you take all the operations affecting a specific
  179. address, a consistent ordering exists.
  180. Relevant standard
  181. This corresponds to the C++11/C11 ``memory_order_relaxed``; see those
  182. standards for the exact definition.
  183. Notes for frontends
  184. If you are writing a frontend which uses this directly, use with caution. The
  185. guarantees in terms of synchronization are very weak, so make sure these are
  186. only used in a pattern which you know is correct. Generally, these would
  187. either be used for atomic operations which do not protect other memory (like
  188. an atomic counter), or along with a ``fence``.
  189. Notes for optimizers
  190. In terms of the optimizer, this can be treated as a read+write on the relevant
  191. memory location (and alias analysis will take advantage of that). In addition,
  192. it is legal to reorder non-atomic and Unordered loads around Monotonic
  193. loads. CSE/DSE and a few other optimizations are allowed, but Monotonic
  194. operations are unlikely to be used in ways which would make those
  195. optimizations useful.
  196. Notes for code generation
  197. Code generation is essentially the same as that for unordered for loads and
  198. stores. No fences are required. ``cmpxchg`` and ``atomicrmw`` are required
  199. to appear as a single operation.
  200. Acquire
  201. -------
  202. Acquire provides a barrier of the sort necessary to acquire a lock to access
  203. other memory with normal loads and stores.
  204. Relevant standard
  205. This corresponds to the C++11/C11 ``memory_order_acquire``. It should also be
  206. used for C++11/C11 ``memory_order_consume``.
  207. Notes for frontends
  208. If you are writing a frontend which uses this directly, use with caution.
  209. Acquire only provides a semantic guarantee when paired with a Release
  210. operation.
  211. Notes for optimizers
  212. Optimizers not aware of atomics can treat this like a nothrow call. It is
  213. also possible to move stores from before an Acquire load or read-modify-write
  214. operation to after it, and move non-Acquire loads from before an Acquire
  215. operation to after it.
  216. Notes for code generation
  217. Architectures with weak memory ordering (essentially everything relevant today
  218. except x86 and SPARC) require some sort of fence to maintain the Acquire
  219. semantics. The precise fences required varies widely by architecture, but for
  220. a simple implementation, most architectures provide a barrier which is strong
  221. enough for everything (``dmb`` on ARM, ``sync`` on PowerPC, etc.). Putting
  222. such a fence after the equivalent Monotonic operation is sufficient to
  223. maintain Acquire semantics for a memory operation.
  224. Release
  225. -------
  226. Release is similar to Acquire, but with a barrier of the sort necessary to
  227. release a lock.
  228. Relevant standard
  229. This corresponds to the C++11/C11 ``memory_order_release``.
  230. Notes for frontends
  231. If you are writing a frontend which uses this directly, use with caution.
  232. Release only provides a semantic guarantee when paired with a Acquire
  233. operation.
  234. Notes for optimizers
  235. Optimizers not aware of atomics can treat this like a nothrow call. It is
  236. also possible to move loads from after a Release store or read-modify-write
  237. operation to before it, and move non-Release stores from after an Release
  238. operation to before it.
  239. Notes for code generation
  240. See the section on Acquire; a fence before the relevant operation is usually
  241. sufficient for Release. Note that a store-store fence is not sufficient to
  242. implement Release semantics; store-store fences are generally not exposed to
  243. IR because they are extremely difficult to use correctly.
  244. AcquireRelease
  245. --------------
  246. AcquireRelease (``acq_rel`` in IR) provides both an Acquire and a Release
  247. barrier (for fences and operations which both read and write memory).
  248. Relevant standard
  249. This corresponds to the C++11/C11 ``memory_order_acq_rel``.
  250. Notes for frontends
  251. If you are writing a frontend which uses this directly, use with caution.
  252. Acquire only provides a semantic guarantee when paired with a Release
  253. operation, and vice versa.
  254. Notes for optimizers
  255. In general, optimizers should treat this like a nothrow call; the possible
  256. optimizations are usually not interesting.
  257. Notes for code generation
  258. This operation has Acquire and Release semantics; see the sections on Acquire
  259. and Release.
  260. SequentiallyConsistent
  261. ----------------------
  262. SequentiallyConsistent (``seq_cst`` in IR) provides Acquire semantics for loads
  263. and Release semantics for stores. Additionally, it guarantees that a total
  264. ordering exists between all SequentiallyConsistent operations.
  265. Relevant standard
  266. This corresponds to the C++11/C11 ``memory_order_seq_cst``, Java volatile, and
  267. the gcc-compatible ``__sync_*`` builtins which do not specify otherwise.
  268. Notes for frontends
  269. If a frontend is exposing atomic operations, these are much easier to reason
  270. about for the programmer than other kinds of operations, and using them is
  271. generally a practical performance tradeoff.
  272. Notes for optimizers
  273. Optimizers not aware of atomics can treat this like a nothrow call. For
  274. SequentiallyConsistent loads and stores, the same reorderings are allowed as
  275. for Acquire loads and Release stores, except that SequentiallyConsistent
  276. operations may not be reordered.
  277. Notes for code generation
  278. SequentiallyConsistent loads minimally require the same barriers as Acquire
  279. operations and SequentiallyConsistent stores require Release
  280. barriers. Additionally, the code generator must enforce ordering between
  281. SequentiallyConsistent stores followed by SequentiallyConsistent loads. This
  282. is usually done by emitting either a full fence before the loads or a full
  283. fence after the stores; which is preferred varies by architecture.
  284. Atomics and IR optimization
  285. ===========================
  286. Predicates for optimizer writers to query:
  287. * ``isSimple()``: A load or store which is not volatile or atomic. This is
  288. what, for example, memcpyopt would check for operations it might transform.
  289. * ``isUnordered()``: A load or store which is not volatile and at most
  290. Unordered. This would be checked, for example, by LICM before hoisting an
  291. operation.
  292. * ``mayReadFromMemory()``/``mayWriteToMemory()``: Existing predicate, but note
  293. that they return true for any operation which is volatile or at least
  294. Monotonic.
  295. * ``isAtLeastAcquire()``/``isAtLeastRelease()``: These are predicates on
  296. orderings. They can be useful for passes that are aware of atomics, for
  297. example to do DSE across a single atomic access, but not across a
  298. release-acquire pair (see MemoryDependencyAnalysis for an example of this)
  299. * Alias analysis: Note that AA will return ModRef for anything Acquire or
  300. Release, and for the address accessed by any Monotonic operation.
  301. To support optimizing around atomic operations, make sure you are using the
  302. right predicates; everything should work if that is done. If your pass should
  303. optimize some atomic operations (Unordered operations in particular), make sure
  304. it doesn't replace an atomic load or store with a non-atomic operation.
  305. Some examples of how optimizations interact with various kinds of atomic
  306. operations:
  307. * ``memcpyopt``: An atomic operation cannot be optimized into part of a
  308. memcpy/memset, including unordered loads/stores. It can pull operations
  309. across some atomic operations.
  310. * LICM: Unordered loads/stores can be moved out of a loop. It just treats
  311. monotonic operations like a read+write to a memory location, and anything
  312. stricter than that like a nothrow call.
  313. * DSE: Unordered stores can be DSE'ed like normal stores. Monotonic stores can
  314. be DSE'ed in some cases, but it's tricky to reason about, and not especially
  315. important. It is possible in some case for DSE to operate across a stronger
  316. atomic operation, but it is fairly tricky. DSE delegates this reasoning to
  317. MemoryDependencyAnalysis (which is also used by other passes like GVN).
  318. * Folding a load: Any atomic load from a constant global can be constant-folded,
  319. because it cannot be observed. Similar reasoning allows scalarrepl with
  320. atomic loads and stores.
  321. Atomics and Codegen
  322. ===================
  323. Atomic operations are represented in the SelectionDAG with ``ATOMIC_*`` opcodes.
  324. On architectures which use barrier instructions for all atomic ordering (like
  325. ARM), appropriate fences can be emitted by the AtomicExpand Codegen pass if
  326. ``setInsertFencesForAtomic()`` was used.
  327. The MachineMemOperand for all atomic operations is currently marked as volatile;
  328. this is not correct in the IR sense of volatile, but CodeGen handles anything
  329. marked volatile very conservatively. This should get fixed at some point.
  330. Common architectures have some way of representing at least a pointer-sized
  331. lock-free ``cmpxchg``; such an operation can be used to implement all the other
  332. atomic operations which can be represented in IR up to that size. Backends are
  333. expected to implement all those operations, but not operations which cannot be
  334. implemented in a lock-free manner. It is expected that backends will give an
  335. error when given an operation which cannot be implemented. (The LLVM code
  336. generator is not very helpful here at the moment, but hopefully that will
  337. change.)
  338. On x86, all atomic loads generate a ``MOV``. SequentiallyConsistent stores
  339. generate an ``XCHG``, other stores generate a ``MOV``. SequentiallyConsistent
  340. fences generate an ``MFENCE``, other fences do not cause any code to be
  341. generated. cmpxchg uses the ``LOCK CMPXCHG`` instruction. ``atomicrmw xchg``
  342. uses ``XCHG``, ``atomicrmw add`` and ``atomicrmw sub`` use ``XADD``, and all
  343. other ``atomicrmw`` operations generate a loop with ``LOCK CMPXCHG``. Depending
  344. on the users of the result, some ``atomicrmw`` operations can be translated into
  345. operations like ``LOCK AND``, but that does not work in general.
  346. On ARM (before v8), MIPS, and many other RISC architectures, Acquire, Release,
  347. and SequentiallyConsistent semantics require barrier instructions for every such
  348. operation. Loads and stores generate normal instructions. ``cmpxchg`` and
  349. ``atomicrmw`` can be represented using a loop with LL/SC-style instructions
  350. which take some sort of exclusive lock on a cache line (``LDREX`` and ``STREX``
  351. on ARM, etc.).
  352. It is often easiest for backends to use AtomicExpandPass to lower some of the
  353. atomic constructs. Here are some lowerings it can do:
  354. * cmpxchg -> loop with load-linked/store-conditional
  355. by overriding ``hasLoadLinkedStoreConditional()``, ``emitLoadLinked()``,
  356. ``emitStoreConditional()``
  357. * large loads/stores -> ll-sc/cmpxchg
  358. by overriding ``shouldExpandAtomicStoreInIR()``/``shouldExpandAtomicLoadInIR()``
  359. * strong atomic accesses -> monotonic accesses + fences
  360. by using ``setInsertFencesForAtomic()`` and overriding ``emitLeadingFence()``
  361. and ``emitTrailingFence()``
  362. * atomic rmw -> loop with cmpxchg or load-linked/store-conditional
  363. by overriding ``expandAtomicRMWInIR()``
  364. For an example of all of these, look at the ARM backend.