jit-thoughts 5.7 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182
  1. Just some thoughts for the JITer:
  2. General issues:
  3. ===============
  4. We are designing a JIT compiler, so we have to consider two things:
  5. - the quality of the generated code
  6. - the time needed to generate that code
  7. The current approach is to keep the JITer as simple as possible, and thus as
  8. fast as possible. The generated code quality will suffer from that.
  9. Register allocation is first done inside the trees of the forest, and each
  10. tree can use the full set of registers. We simply split a tree if we get out of
  11. registers, for example the following tree:
  12. add(R0)
  13. / \
  14. / \
  15. a(R0) add(R1)
  16. / \
  17. / \
  18. b(R1) add(R2)
  19. / \
  20. / \
  21. c(R2) b(R3)
  22. can be transformed to:
  23. stloc(t1) add(R0)
  24. | / \
  25. | / \
  26. add(R0) a(R0) add(R1)
  27. / \ / \
  28. / \ / \
  29. c(R0) b(R1) b(R1) t1(R2)
  30. Please notice that the split trees use less registers than the original
  31. tree.
  32. Triggering JIT compilation:
  33. ===========================
  34. The current approach is to call functions indirectly. The address to call is
  35. stored in the MonoMethod structure. For each method we create a trampoline
  36. function. When called, this function does the JIT compilation and replaces the
  37. trampoline with the compiled method address.
  38. We should consider using the CACAO approach, they do not use a trampoline at
  39. all.
  40. Register Allocation:
  41. ====================
  42. With lcc you can assign a fixed register to a tree before register
  43. allocation. For example this is needed by call, which return the value always
  44. in EAX on x86. The current implementation works without such system, due to
  45. special forest generation.
  46. X86 Register Allocation:
  47. ========================
  48. We can use 8bit or 16bit registers on the x86. If we use that feature we have
  49. more registers to allocate, which maybe prevents some register spills. We
  50. currently ignore that ability and always allocate 32 bit registers, because I
  51. think we would gain very little from that optimisation and it would complicate
  52. the code.
  53. Different Register Sets:
  54. ========================
  55. Most processors have more that one register set, at least one for floating
  56. point values, and one for integers. Should we support architectures with more
  57. that two sets? Does someone knows such an architecture?
  58. 64bit Integer Values:
  59. =====================
  60. I can imagine two different implementation. On possibility would be to treat
  61. long (64bit) values simply like any other value type. This implies that we
  62. call class methods for ALU operations like add or sub. Sure, this method will
  63. be be a bit inefficient.
  64. The more performant solution is to allocate two 32bit registers for each 64bit
  65. value. We add a new non terminal to the monoburg grammar called long_reg. The
  66. register allocation routines takes care of this non terminal and allocates two
  67. 32 bit registers for them.
  68. Forest generation:
  69. ==================
  70. Consider the following code:
  71. OPCODE: STACK LOCALS
  72. LDLOC.0 (5) [5,0]
  73. LDC.1 (5,1) [5,0]
  74. STLOC.0 (5) [1,0]
  75. STLOC.1 () [1,5]
  76. The current forest generation generates:
  77. STLOC.0(LDC.1)
  78. STLOC.1(LDLOC.0)
  79. Which is wrong, since it stores the wrong value (1 instead of 5). Instead we
  80. must generate something like:
  81. STLOC.TMP(LDLOC.0)
  82. STLOC.0(LDC.1)
  83. STLOC.1(LDLOC.TMP)
  84. Where STLOC.TMP saves the value into a new temporary variable.
  85. We also need a similar solution for basic block boundaries when the stack depth
  86. is not zero. We can simply save those values to new temporary values. Consider
  87. the following basic block with one instruction:
  88. LDLOC.1
  89. This should generate a tree like:
  90. STLOC.TMP(LDLOC.1) Please notice that an intelligent register allocator can
  91. still allocate registers for those new variables.
  92. DAG handling:
  93. =============
  94. Monoburg can't handle DAGs, instead we need real trees as input for
  95. the code generator. So we have two problems:
  96. 1.) DUP instruction: This one is obvious - we need to store the value
  97. into a temporary variable to solve the problem.
  98. 2.) function calls: Chapter 12.8, page 343 of "A retargetable C compiler"
  99. explains that: "because listing a call node will give it a hidden reference
  100. from the code list". I don't understand that (can someone explain that?), but
  101. there is another reason to save return values to temporaries: Consider the
  102. following code:
  103. x = f(y) + g(z); // all functions return integers
  104. We could generate such a tree for this expression: STLOC(ADD(CALL,CALL))
  105. The problem is that both calls returns the value in the same register,
  106. so it is non trivial to generate code for that tree. We must copy one
  107. register into another one, which make register allocation more complex.
  108. The easier solution is store the result of function calls to
  109. temporaries. This leads to the following forest:
  110. STLOC(CALL)
  111. STLOC(CALL)
  112. STLOC(ADD (LDLOC, LDLOC))
  113. This is what lcc is doing, if I understood 12.8, page 342, 343?
  114. Value Types:
  115. ============
  116. The only CLI instructions which can handle value types are loads and stores,
  117. either to local variable, to the stack or to array elements. Value types with a
  118. size smaller than sizeof(int) are handled like any other basic type. For other
  119. value types we load the base address and emit block copies to store them.
  120. Possible Optimisations:
  121. =======================
  122. Miguel said ORP does some optimisation on IL level, for example moving array
  123. bounds checking out of loops:
  124. for (i = 0; i < N; i++) { check_range (a, i); a [i] = X; }
  125. id transformed to:
  126. if (in_range (a, 0, N)) { for (i = 0; i < N; i++) a[i] = X; }
  127. else for (i = 0; i < N; i++) { check_range (a, i); a [i] = X; }
  128. The "else" is only to keep original semantics (exception handling).