| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208 |
- * Proposal for the local register allocator
- The local register allocator deals with allocating registers
- for temporaries inside a single basic block, while the global
- register allocator is concerned with method-wide allocation of
- variables.
- The global register allocator uses callee-saved register for it's
- purpouse so that there is no need to save and restore these registers
- at call sites.
- There are a number of issues the local allocator needs to deal with:
- *) some instructions expect operands in specific registers (for example
- the shl instruction on x86, or the call instruction with thiscall
- convention, or the equivalent call instructions on other architectures,
- such as the need to put output registers in %oX on sparc)
- *) some instructions deliver results only in specific registers (for example
- the div instruction on x86, or the call instructionson on almost all
- the architectures).
- *) it needs to know what registers may be clobbered by an instruction
- (such as in a method call)
- *) it should avoid excessive reloads or stores to improve performance
-
- While which specific instructions have limitations is architecture-dependent,
- the problem shold be solved in an arch-independent way to reduce code duplication.
- The register allocator will be 'driven' by the arch-dependent code, but it's
- implementation should be arch-independent.
- To improve the current local register allocator, we need to
- keep more state in it than the current setup that only keeps busy/free info.
- Possible state information is:
- free: the resgister is free to use and it doesn't contain useful info
- freeable: the register contains data loaded from a local (there is
- also info about _which_ local it contains) as a result from previous
- instructions (like, there was a store from the register to the local)
- moveable: it contains live data that is needed in a following instruction, but
- the contents may be moved to a different register
- busy: the register contains live data and it is placed there because
- the following instructions need it exactly in that register
- allocated: the register is used by the global allocator
- The local register allocator will have the following interfaces:
- int get_register ();
- Searches for a register in the free state. If it doesn't find it,
- searches for a freeable register. Sets the status to moveable.
- Looking for a 'free' register before a freeable one should allow for
- removing a few redundant loads (though I'm still unsure if such
- things should be delegated entirely to the peephole pass).
-
- int get_register_force (int reg);
- Returns 'reg' if it is free or freeable. If it is moveable, it moves it
- to another free or freeable register.
- Sets the status of 'reg' to busy.
-
- void set_register_freeable (int reg);
- Sets the status of 'reg' to freeable.
-
- void set_register_free (int reg);
- Sets the status of 'reg' to free.
- void will_clobber (int reg);
- Spills the register to the stack. Sets the status to freeable.
- After the clobbering has occurred, set the status to free.
- void register_unspill (int reg);
- Un-spills register reg and sets the status to moveable.
- FIXME: how is the 'local' information represented? Maybe a MonoInst* pointer.
- Note: the register allocator will insert instructions in the basic block
- during it's operation.
- * Examples
- Given the tree (on x86 the right argument to shl needs to be in ecx):
- store (local1, shl (local1, call (some_arg)))
- At the start of the basic block, the registers are set to the free state.
- The sequence of instructions may be:
- instruction register status -> [%eax %ecx %edx]
- start free free free
- eax = load local1 mov free free
- /* call clobbers eax, ecx, edx */
- spill eax free free free
- call mov free free
- /* now eax contains the right operand of the shl */
- mov %eax -> %ecx free busy free
- un-spill mov busy free
- shl %cl, %eax mov free free
-
- The resulting x86 code is:
- mov $fffc(%ebp), %eax
- mov %eax, $fff0(%ebp)
- push some_arg
- call func
- mov %eax, %ecx
- mov $fff0(%ebp), %eax
- shl %cl, %eax
-
- Note that since shl could operate directly on memory, we could have:
-
- push some_arg
- call func
- mov %eax, %ecx
- shl %cl, $fffc(%ebp)
- The above example with loading the operand in a register is just to complicate
- the example and show that the algorithm should be able to handle it.
- Let's take another example with the this-call call convention (the first argument
- is passed in %ecx).
- In this case, will_clobber() will be called only on %eax and %edx, while %ecx
- will be allocated with get_register_force ().
- Note: when a register is allocated with get_register_force(), it should be set
- to a different state as soon as possible.
- store (local1, shl (local1, this-call (local1)))
- instruction register status -> [%eax %ecx %edx]
- start free free free
- eax = load local1 mov free free
- /* force load in %ecx */
- ecx = load local1 mov busy free
- spill eax free busy free
- call mov free free
- /* now eax contains the right operand of the shl */
- mov %eax -> %ecx free busy free
- un-spill mov busy free
- shl %cl, %eax mov free free
- What happens when a register that we need to allocate with get_register_force ()
- contains an operand for the next instruction?
- instruction register status -> [%eax %ecx %edx]
- eax = load local0 mov free free
- ecx = load local1 mov mov free
- get_register_force (ecx) here.
- We have two options:
- mov %ecx, %edx
- or:
- spill %ecx
- The first option is way better (and allows the peephole pass to
- just load the value in %edx directly, instead of loading first to %ecx).
- This doesn't work, though, if the instruction clobbers the %edx register
- (like in a this-call). So, we first need to clobber the registers
- (so the state of %ecx changes to freebale and there is no issue
- with get_register_force ()).
- What if an instruction both clobbers a register and requires it as
- an operand? Lets' take the x86 idiv instruction as an example: it
- requires the dividend in edx:eax and returns the result in eax,
- with the modulus in edx.
-
- store (local1, div (local1, local2))
-
- instruction register status -> [%eax %ecx %edx]
- eax = load local0 mov free free
- will_clobber eax, edx free mov free
- force mov %ecx, %eax busy free free
- set %edx busy free busy
- idiv mov free free
-
- Note: edx is set to free after idiv, because the modulus is not needed
- (if it was a rem, eax would have been freed).
- If we load the divisor before will_clobber(), we'll have to spill
- eax and reload it later. If we load it just after the idiv, there is no issue.
- In any case, the algorithm should give the correct results and allow the operation.
-
- Working recursively on the isntructions there shouldn't be huge issues
- with this algorithm (though, of course, it's not optimal and it may
- introduce excessive spills or register moves). The advantage over the current
- local reg allocator is that:
- 1) the number of spills/moves would be smaller anyway
- 2) a separate peephole pass could be able to eliminate reg moves
- 3) we'll be able to remove the 'forced' spills we currently do with
- the return value of method calls
- * Issues
- How to best integrate such a reg allocator with the burg stuff.
- Think about a call os sparc with two arguments: they got into %o0 and %o1
- and each of them sets the register as busy. But what if the values to put there
- are themselves the result of a call? %o0 is no problem, but for all the
- next argument n the above algorithm would spill all the 0...n-1 registers...
- * Papers
- More complex solutions to the local register allocator problem:
- http://dimacs.rutgers.edu/TechnicalReports/abstracts/1997/97-33.html
- Combining register allocation and instruction scheduling:
- http://citeseer.nj.nec.com/motwani95combining.html
- More on LRA euristics:
- http://citeseer.nj.nec.com/liberatore97hardness.html
- Linear-time optimal code scheduling for delayedload architectures
- http://www.cs.wisc.edu/~fischer/cs701.f01/inst.sched.ps.gz
- Precise Register Allocation for Irregular Architectures
- http://citeseer.nj.nec.com/kong98precise.html
- Allocate registers first to subtrees that need more of them.
- http://www.upb.de/cs/ag-kastens/compii/folien/comment401-409.2.pdf
|