local-regalloc.txt 8.8 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208
  1. * Proposal for the local register allocator
  2. The local register allocator deals with allocating registers
  3. for temporaries inside a single basic block, while the global
  4. register allocator is concerned with method-wide allocation of
  5. variables.
  6. The global register allocator uses callee-saved register for it's
  7. purpouse so that there is no need to save and restore these registers
  8. at call sites.
  9. There are a number of issues the local allocator needs to deal with:
  10. *) some instructions expect operands in specific registers (for example
  11. the shl instruction on x86, or the call instruction with thiscall
  12. convention, or the equivalent call instructions on other architectures,
  13. such as the need to put output registers in %oX on sparc)
  14. *) some instructions deliver results only in specific registers (for example
  15. the div instruction on x86, or the call instructionson on almost all
  16. the architectures).
  17. *) it needs to know what registers may be clobbered by an instruction
  18. (such as in a method call)
  19. *) it should avoid excessive reloads or stores to improve performance
  20. While which specific instructions have limitations is architecture-dependent,
  21. the problem shold be solved in an arch-independent way to reduce code duplication.
  22. The register allocator will be 'driven' by the arch-dependent code, but it's
  23. implementation should be arch-independent.
  24. To improve the current local register allocator, we need to
  25. keep more state in it than the current setup that only keeps busy/free info.
  26. Possible state information is:
  27. free: the resgister is free to use and it doesn't contain useful info
  28. freeable: the register contains data loaded from a local (there is
  29. also info about _which_ local it contains) as a result from previous
  30. instructions (like, there was a store from the register to the local)
  31. moveable: it contains live data that is needed in a following instruction, but
  32. the contents may be moved to a different register
  33. busy: the register contains live data and it is placed there because
  34. the following instructions need it exactly in that register
  35. allocated: the register is used by the global allocator
  36. The local register allocator will have the following interfaces:
  37. int get_register ();
  38. Searches for a register in the free state. If it doesn't find it,
  39. searches for a freeable register. Sets the status to moveable.
  40. Looking for a 'free' register before a freeable one should allow for
  41. removing a few redundant loads (though I'm still unsure if such
  42. things should be delegated entirely to the peephole pass).
  43. int get_register_force (int reg);
  44. Returns 'reg' if it is free or freeable. If it is moveable, it moves it
  45. to another free or freeable register.
  46. Sets the status of 'reg' to busy.
  47. void set_register_freeable (int reg);
  48. Sets the status of 'reg' to freeable.
  49. void set_register_free (int reg);
  50. Sets the status of 'reg' to free.
  51. void will_clobber (int reg);
  52. Spills the register to the stack. Sets the status to freeable.
  53. After the clobbering has occurred, set the status to free.
  54. void register_unspill (int reg);
  55. Un-spills register reg and sets the status to moveable.
  56. FIXME: how is the 'local' information represented? Maybe a MonoInst* pointer.
  57. Note: the register allocator will insert instructions in the basic block
  58. during it's operation.
  59. * Examples
  60. Given the tree (on x86 the right argument to shl needs to be in ecx):
  61. store (local1, shl (local1, call (some_arg)))
  62. At the start of the basic block, the registers are set to the free state.
  63. The sequence of instructions may be:
  64. instruction register status -> [%eax %ecx %edx]
  65. start free free free
  66. eax = load local1 mov free free
  67. /* call clobbers eax, ecx, edx */
  68. spill eax free free free
  69. call mov free free
  70. /* now eax contains the right operand of the shl */
  71. mov %eax -> %ecx free busy free
  72. un-spill mov busy free
  73. shl %cl, %eax mov free free
  74. The resulting x86 code is:
  75. mov $fffc(%ebp), %eax
  76. mov %eax, $fff0(%ebp)
  77. push some_arg
  78. call func
  79. mov %eax, %ecx
  80. mov $fff0(%ebp), %eax
  81. shl %cl, %eax
  82. Note that since shl could operate directly on memory, we could have:
  83. push some_arg
  84. call func
  85. mov %eax, %ecx
  86. shl %cl, $fffc(%ebp)
  87. The above example with loading the operand in a register is just to complicate
  88. the example and show that the algorithm should be able to handle it.
  89. Let's take another example with the this-call call convention (the first argument
  90. is passed in %ecx).
  91. In this case, will_clobber() will be called only on %eax and %edx, while %ecx
  92. will be allocated with get_register_force ().
  93. Note: when a register is allocated with get_register_force(), it should be set
  94. to a different state as soon as possible.
  95. store (local1, shl (local1, this-call (local1)))
  96. instruction register status -> [%eax %ecx %edx]
  97. start free free free
  98. eax = load local1 mov free free
  99. /* force load in %ecx */
  100. ecx = load local1 mov busy free
  101. spill eax free busy free
  102. call mov free free
  103. /* now eax contains the right operand of the shl */
  104. mov %eax -> %ecx free busy free
  105. un-spill mov busy free
  106. shl %cl, %eax mov free free
  107. What happens when a register that we need to allocate with get_register_force ()
  108. contains an operand for the next instruction?
  109. instruction register status -> [%eax %ecx %edx]
  110. eax = load local0 mov free free
  111. ecx = load local1 mov mov free
  112. get_register_force (ecx) here.
  113. We have two options:
  114. mov %ecx, %edx
  115. or:
  116. spill %ecx
  117. The first option is way better (and allows the peephole pass to
  118. just load the value in %edx directly, instead of loading first to %ecx).
  119. This doesn't work, though, if the instruction clobbers the %edx register
  120. (like in a this-call). So, we first need to clobber the registers
  121. (so the state of %ecx changes to freebale and there is no issue
  122. with get_register_force ()).
  123. What if an instruction both clobbers a register and requires it as
  124. an operand? Lets' take the x86 idiv instruction as an example: it
  125. requires the dividend in edx:eax and returns the result in eax,
  126. with the modulus in edx.
  127. store (local1, div (local1, local2))
  128. instruction register status -> [%eax %ecx %edx]
  129. eax = load local0 mov free free
  130. will_clobber eax, edx free mov free
  131. force mov %ecx, %eax busy free free
  132. set %edx busy free busy
  133. idiv mov free free
  134. Note: edx is set to free after idiv, because the modulus is not needed
  135. (if it was a rem, eax would have been freed).
  136. If we load the divisor before will_clobber(), we'll have to spill
  137. eax and reload it later. If we load it just after the idiv, there is no issue.
  138. In any case, the algorithm should give the correct results and allow the operation.
  139. Working recursively on the isntructions there shouldn't be huge issues
  140. with this algorithm (though, of course, it's not optimal and it may
  141. introduce excessive spills or register moves). The advantage over the current
  142. local reg allocator is that:
  143. 1) the number of spills/moves would be smaller anyway
  144. 2) a separate peephole pass could be able to eliminate reg moves
  145. 3) we'll be able to remove the 'forced' spills we currently do with
  146. the return value of method calls
  147. * Issues
  148. How to best integrate such a reg allocator with the burg stuff.
  149. Think about a call os sparc with two arguments: they got into %o0 and %o1
  150. and each of them sets the register as busy. But what if the values to put there
  151. are themselves the result of a call? %o0 is no problem, but for all the
  152. next argument n the above algorithm would spill all the 0...n-1 registers...
  153. * Papers
  154. More complex solutions to the local register allocator problem:
  155. http://dimacs.rutgers.edu/TechnicalReports/abstracts/1997/97-33.html
  156. Combining register allocation and instruction scheduling:
  157. http://citeseer.nj.nec.com/motwani95combining.html
  158. More on LRA euristics:
  159. http://citeseer.nj.nec.com/liberatore97hardness.html
  160. Linear-time optimal code scheduling for delayedload architectures
  161. http://www.cs.wisc.edu/~fischer/cs701.f01/inst.sched.ps.gz
  162. Precise Register Allocation for Irregular Architectures
  163. http://citeseer.nj.nec.com/kong98precise.html
  164. Allocate registers first to subtrees that need more of them.
  165. http://www.upb.de/cs/ag-kastens/compii/folien/comment401-409.2.pdf