| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182 |
- Just some thoughts for the JITer:
- General issues:
- ===============
- We are designing a JIT compiler, so we have to consider two things:
- - the quality of the generated code
- - the time needed to generate that code
- The current approach is to keep the JITer as simple as possible, and thus as
- fast as possible. The generated code quality will suffer from that.
- Register allocation is first done inside the trees of the forest, and each
- tree can use the full set of registers. We simply split a tree if we get out of
- registers, for example the following tree:
- add(R0)
- / \
- / \
- a(R0) add(R1)
- / \
- / \
- b(R1) add(R2)
- / \
- / \
- c(R2) b(R3)
- can be transformed to:
- stloc(t1) add(R0)
- | / \
- | / \
- add(R0) a(R0) add(R1)
- / \ / \
- / \ / \
- c(R0) b(R1) b(R1) t1(R2)
- Please notice that the split trees use less registers than the original
- tree.
- Triggering JIT compilation:
- ===========================
- The current approach is to call functions indirectly. The address to call is
- stored in the MonoMethod structure. For each method we create a trampoline
- function. When called, this function does the JIT compilation and replaces the
- trampoline with the compiled method address.
- We should consider using the CACAO approach, they do not use a trampoline at
- all.
- Register Allocation:
- ====================
- With lcc you can assign a fixed register to a tree before register
- allocation. For example this is needed by call, which return the value always
- in EAX on x86. The current implementation works without such system, due to
- special forest generation.
- X86 Register Allocation:
- ========================
- We can use 8bit or 16bit registers on the x86. If we use that feature we have
- more registers to allocate, which maybe prevents some register spills. We
- currently ignore that ability and always allocate 32 bit registers, because I
- think we would gain very little from that optimisation and it would complicate
- the code.
- Different Register Sets:
- ========================
- Most processors have more that one register set, at least one for floating
- point values, and one for integers. Should we support architectures with more
- that two sets? Does someone knows such an architecture?
- 64bit Integer Values:
- =====================
- I can imagine two different implementation. On possibility would be to treat
- long (64bit) values simply like any other value type. This implies that we
- call class methods for ALU operations like add or sub. Sure, this method will
- be be a bit inefficient.
- The more performant solution is to allocate two 32bit registers for each 64bit
- value. We add a new non terminal to the monoburg grammar called long_reg. The
- register allocation routines takes care of this non terminal and allocates two
- 32 bit registers for them.
- Forest generation:
- ==================
- Consider the following code:
- OPCODE: STACK LOCALS
- LDLOC.0 (5) [5,0]
- LDC.1 (5,1) [5,0]
- STLOC.0 (5) [1,0]
- STLOC.1 () [1,5]
- The current forest generation generates:
- STLOC.0(LDC.1)
- STLOC.1(LDLOC.0)
- Which is wrong, since it stores the wrong value (1 instead of 5). Instead we
- must generate something like:
- STLOC.TMP(LDLOC.0)
- STLOC.0(LDC.1)
- STLOC.1(LDLOC.TMP)
- Where STLOC.TMP saves the value into a new temporary variable.
- We also need a similar solution for basic block boundaries when the stack depth
- is not zero. We can simply save those values to new temporary values. Consider
- the following basic block with one instruction:
- LDLOC.1
- This should generate a tree like:
- STLOC.TMP(LDLOC.1) Please notice that an intelligent register allocator can
- still allocate registers for those new variables.
- DAG handling:
- =============
- Monoburg can't handle DAGs, instead we need real trees as input for
- the code generator. So we have two problems:
- 1.) DUP instruction: This one is obvious - we need to store the value
- into a temporary variable to solve the problem.
- 2.) function calls: Chapter 12.8, page 343 of "A retargetable C compiler"
- explains that: "because listing a call node will give it a hidden reference
- from the code list". I don't understand that (can someone explain that?), but
- there is another reason to save return values to temporaries: Consider the
- following code:
- x = f(y) + g(z); // all functions return integers
- We could generate such a tree for this expression: STLOC(ADD(CALL,CALL))
- The problem is that both calls returns the value in the same register,
- so it is non trivial to generate code for that tree. We must copy one
- register into another one, which make register allocation more complex.
- The easier solution is store the result of function calls to
- temporaries. This leads to the following forest:
- STLOC(CALL)
- STLOC(CALL)
- STLOC(ADD (LDLOC, LDLOC))
- This is what lcc is doing, if I understood 12.8, page 342, 343?
- Value Types:
- ============
- The only CLI instructions which can handle value types are loads and stores,
- either to local variable, to the stack or to array elements. Value types with a
- size smaller than sizeof(int) are handled like any other basic type. For other
- value types we load the base address and emit block copies to store them.
- Possible Optimisations:
- =======================
- Miguel said ORP does some optimisation on IL level, for example moving array
- bounds checking out of loops:
- for (i = 0; i < N; i++) { check_range (a, i); a [i] = X; }
- id transformed to:
- if (in_range (a, 0, N)) { for (i = 0; i < N; i++) a[i] = X; }
- else for (i = 0; i < N; i++) { check_range (a, i); a [i] = X; }
- The "else" is only to keep original semantics (exception handling).
|