mini-porting.txt 19 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478
  1. Mono JIT porting guide.
  2. Paolo Molaro ([email protected])
  3. * Introduction
  4. This documents describes the process of porting the mono JIT
  5. to a new CPU architecture. The new mono JIT has been designed
  6. to make porting easier though at the same time enable the port
  7. to take full advantage from the new architecture features and
  8. instructions. Knowledge of the mini architecture (described in
  9. the mini-doc.txt file) is a requirement for understanding this
  10. guide, as well as an earlier document about porting the mono
  11. interpreter (available on the web site).
  12. There are six main areas that a port needs to implement to
  13. have a fully-functional JIT for a given architecture:
  14. 1) instruction selection
  15. 2) native code emission
  16. 3) call conventions and register allocation
  17. 4) method trampolines
  18. 5) exception handling
  19. 6) minor helper methods
  20. To take advantage of some not-so-common processor features
  21. (for example conditional execution of instructions as may be
  22. found on ARM or ia64), it may be needed to develop an
  23. high-level optimization, but doing so is not a requirement for
  24. getting the JIT to work.
  25. We'll see in more details each of the steps required, note,
  26. though, that a new port may just as well start from a
  27. cut&paste of an existing port to a similar architecture (for
  28. example from x86 to amd64, or from powerpc to sparc).
  29. The architecture specific code is split from the rest of the
  30. JIT, for example the x86 specific code and data is all
  31. included in the following files in the distribution:
  32. mini-x86.h mini-x86.c
  33. inssel-x86.brg
  34. cpu-pentium.md
  35. tramp-x86.c
  36. exceptions-x86.c
  37. I suggest a similar split for other architectures as well.
  38. Note that this document is still incomplete: some sections are
  39. only sketched and some are missing, but the important info to
  40. get a port going is already described.
  41. * Architecture-specific instructions and instruction selection.
  42. The JIT already provides a set of instructions that can be
  43. easily mapped to a great variety of different processor
  44. instructions. Sometimes it may be necessary or advisable to
  45. add a new instruction that represent more closely an
  46. instruction in the architecture. Note that a mini instruction
  47. can be used to represent also a short sequence of CPU
  48. low-level instructions, but note that each instruction
  49. represents the minimum amount of code the instruction
  50. scheduler will handle (i.e., the scheduler won't schedule the
  51. instructions that compose the low-level sequence as individual
  52. instructions, but just the whole sequence, as an indivisible
  53. block).
  54. New instructions are created by adding a line in the
  55. mini-ops.h file, assigning an opcode and a name. To specify
  56. the input and output for the instruction, there are two
  57. different places, depending on the context in which the
  58. instruction gets used.
  59. If the instruction is used in the tree representation, the
  60. input and output types are defined by the BURG rules in the
  61. *.brg files (the usual non-terminals are 'reg' to represent a
  62. normal register, 'lreg' to represent a register or two that
  63. hold a 64 bit value, freg for a floating point register).
  64. If an instruction is used as a low-level CPU instruction, the
  65. info is specified in a machine description file. The
  66. description file is processed by the genmdesc program to
  67. provide a data structure that can be easily used from C code
  68. to query the needed info about the instruction.
  69. As an example, let's consider the add instruction for both x86
  70. and ppc:
  71. x86 version:
  72. add: dest:i src1:i src2:i len:2 clob:1
  73. ppc version:
  74. add: dest:i src1:i src2:i len:4
  75. Note that the instruction takes two input integer registers on
  76. both CPU, but on x86 the first source register is clobbered
  77. (clob:1) and the length in bytes of the instruction differs.
  78. Note that integer adds and floating point adds use different
  79. opcodes, unlike the IL language (64 bit add is done with two
  80. instructions on 32 bit architectures, using a add that sets
  81. the carry and an add with carry).
  82. A specific CPU port may assign any meaning to the clob field
  83. for an instruction since the value will be processed in an
  84. arch-specific file anyway.
  85. See the top of the existing cpu-pentium.md file for more info
  86. on other fields: the info may or may not be applicable to a
  87. different CPU, in this latter case the info can be ignored.
  88. The code in mini.c together with the BURG rules in inssel.brg,
  89. inssel-float.brg and inssel-long32.brg provides general
  90. purpose mappings from the tree representation to a set of
  91. instructions that should be easily implemented in any
  92. architecture. To allow for additional arch-specific
  93. functionality, an arch-specific BURG file can be used: in this
  94. file arch-specific instructions can be selected that provide
  95. better performance than the general instructions or that
  96. provide functionality that is needed by the JIT but that
  97. cannot be expressed in a general enough way.
  98. As an example, x86 has the special instruction "push" to make
  99. it easier to implement the default call convention (passing
  100. arguments on the stack): almost all the other architectures
  101. don't have such an instruction (and don't need it anyway), so
  102. we added a special rule in the inssel-x86.brg file for it.
  103. So, one of the first things needed in a port is to write a
  104. cpu-$(arch).md machine description file and fill it with the
  105. needed info. As a start, only a few instructions can be
  106. specified, like the ones required to do simple integer
  107. operations. The default rules of the instruction selector will
  108. emit the common instructions and so we're ready to go for the
  109. next step in porting the JIT.
  110. *) Native code emission
  111. Since the first step in porting mono to a new CPU is to port
  112. the interpreter, there should be already a file that allows
  113. the emission of binary native code in a buffer for the
  114. architecture. This file should be placed in the
  115. mono/arch/$(arch)/
  116. directory.
  117. The bulk of the code emission happens in the mini-$(arch).c
  118. file, in a function called mono_arch_output_basic_block
  119. (). This function takes a basic block, walks the list of
  120. instructions in the block and emits the binary code for each.
  121. Optionally a peephole optimization pass is done on the basic
  122. block, but this can be left for later, when the port actually
  123. works.
  124. This function is very simple, there is just a big switch on
  125. the instruction opcode and in the corresponding case the
  126. functions or macros to emit the binary native code are
  127. used. Note that in this function the lengths of the
  128. instructions are used to determine if the buffer for the code
  129. needs enlarging.
  130. To complete the code emission for a method, a few other
  131. functions need implementing as well:
  132. mono_arch_emit_prolog ()
  133. mono_arch_emit_epilog ()
  134. mono_arch_patch_code ()
  135. mono_arch_emit_prolog () will emit the code to setup the stack
  136. frame for a method, optionally call the callbacks used in
  137. profiling and tracing, and move the arguments to their home
  138. location (in a caller-save register if the variable was
  139. allocated to one, or in a stack location if the argument was
  140. passed in a volatile register and wasn't allocated a
  141. non-volatile one). caller-save registers used by the function
  142. are saved in the prolog as well.
  143. mono_arch_emit_epilog () will emit the code needed to return
  144. from the function, optionally calling the profiling or tracing
  145. callbacks. At this point the basic blocks or the code that was
  146. moved out of the normal flow for the function can be emitted
  147. as well (this is usually done to provide better info for the
  148. static branch predictor). In the epilog, caller-save
  149. registers are restored if they were used.
  150. Note that, to help exception handling and stack unwinding,
  151. when there is a transition from managed to unmanaged code,
  152. some special processing needs to be done (basically, saving
  153. all the registers and setting up the links in the Last Managed
  154. Frame structure).
  155. When the epilog has been emitted, the upper level code
  156. arranges for the buffer of memory that contains the native
  157. code to be copied in an area of executable memory and at this
  158. point, instructions that use relative addressing need to be
  159. patched to have the right offsets: this work is done by
  160. mono_arch_patch_code ().
  161. * Call conventions and register allocation
  162. To account for the differences in the call conventions, a few functions need to
  163. be implemented.
  164. mono_arch_allocate_vars () assigns to both arguments and local
  165. variables the offset relative to the frame register where they
  166. are stored, dead variables are simply discarded. The total
  167. amount of stack needed is calculated.
  168. mono_arch_call_opcode () is the function that more closely
  169. deals with the call convention on a given system. For each
  170. argument to a function call, an instruction is created that
  171. actually puts the argument where needed, be it the stack or a
  172. specific register. This function can also re-arrange th order
  173. of evaluation when multiple arguments are involved if needed
  174. (like, on x86 arguments are pushed on the stack in reverse
  175. order). The function needs to carefully take into accounts
  176. platform specific issues, like how structures are returned as
  177. well as the differences in size and/or alignment of managed
  178. and corresponding unmanaged structures.
  179. The other chunk of code that needs to deal with the call
  180. convention and other specifics of a CPU, is the local register
  181. allocator, implemented in a function named
  182. mono_arch_local_regalloc (). The local allocator deals with a
  183. basic block at a time and basically just allocates registers
  184. for temporary values during expression evaluation, spilling
  185. and unspilling as necessary.
  186. The local allocator needs to take into account clobbering
  187. information, both during simple instructions and during
  188. function calls and it needs to deal with other
  189. architecture-specific weirdnesses, like instructions that take
  190. inputs only in specific registers or output only is some.
  191. Some effort will be put later in moving most of the local
  192. register allocator to a common file so that the code can be
  193. shared more for similar, risc-like CPUs. The register
  194. allocator does a first pass on the instructions in a block,
  195. collecting liveness information and in a backward pass on the
  196. same list performs the actual register allocation, inserting
  197. the instructions needed to spill values, if necessary.
  198. The cross-platform local register allocator is now implemented
  199. and it is documented in the jit-regalloc file.
  200. When this part of code is implemented, some testing can be
  201. done with the generated code for the new architecture. Most
  202. helpful is the use of the --regression command line switch to
  203. run the regression tests (basic.cs, for example).
  204. Note that the JIT will try to initialize the runtime, but it
  205. may not be able yet to compile and execute complex code:
  206. commenting most of the code in the mini_init() function in
  207. mini.c is needed to let the JIT just compile the regression
  208. tests. Also, using multiple -v switches on the command line
  209. makes the JIT dump an increasing amount of information during
  210. compilation.
  211. Values loaded into registers need to be extened as needed by
  212. the ECMA specs:
  213. *) integers smaller than 4 bytes are extended to int32 values
  214. *) 32 bit floats are extended to double precision (in particular
  215. this means that currently all the floating point operations operate
  216. on doubles)
  217. * Method trampolines
  218. To get better startup performance, the JIT actually compiles a
  219. method only when needed. To achieve this, when a call to a
  220. method is compiled, we actually emit a call to a magic
  221. trampoline. The magic trampoline is a function written in
  222. assembly that invokes the compiler to compile the given method
  223. and jumps to the newly compiled code, ensuring the arguments
  224. it received are passed correctly to the actual method.
  225. Before jumping to the new code, though, the magic trampoline
  226. takes care of patching the call site so that next time the
  227. call will go directly to the method instead of the
  228. trampoline. How does this all work?
  229. mono_arch_create_jit_trampoline () creates a small function
  230. that just preserves the arguments passed to it and adds an
  231. additional argument (the method to compile) before calling the
  232. generic trampoline. This small function is called the specific
  233. trampoline, because it is method-specific (the method to
  234. compile is hard-code in the instruction stream).
  235. The generic trampoline saves all the arguments that could get
  236. clobbered and calls a C function that will do two things:
  237. *) actually call the JIT to compile the method
  238. *) identify the calling code so that it can be patched to call directly
  239. the actual method
  240. If the 'this' argument to a method is a boxed valuetype that
  241. is passed to a method that expects just a pointer to the data,
  242. an additional unboxing trampoline will need to be inserted as
  243. well.
  244. * Exception handling
  245. Exception handling is likely the most difficult part of the
  246. port, as it needs to deal with unwinding (both managed and
  247. unmanaged code) and calling catch and filter blocks. It also
  248. needs to deal with signals, because mono takes advantage of
  249. the MMU in the CPU and of the operation system to handle
  250. dereferences of the NULL pointer. Some of the function needed
  251. to implement the mechanisms are:
  252. mono_arch_get_throw_exception () returns a function that takes
  253. an exception object and invokes an arch-specific function that
  254. will enter the exception processing. To do so, all the
  255. relevant registers need to be saved and passed on.
  256. mono_arch_handle_exception () this function takes the
  257. exception thrown and a context that describes the state of the
  258. CPU at the time the exception was thrown. The function needs
  259. to implement the exception handling mechanism, so it makes a
  260. search for an handler for the exception and if none is found,
  261. it follows the unhandled exception path (that can print a
  262. trace and exit or just abort the current thread). The
  263. difficulty here is to unwind the stack correctly, by restoring
  264. the register state at each call site in the call chain,
  265. calling finally, filters and handler blocks while doing so.
  266. As part of exception handling a couple of internal calls need
  267. to be implemented as well.
  268. ves_icall_get_frame_info () returns info about a specific
  269. frame.
  270. mono_jit_walk_stack () walks the stack and calls a callback with info for
  271. each frame found.
  272. ves_icall_get_trace () return an array of StackFrame objects.
  273. ** Code generation for filter/finally handlers
  274. Filter and finally handlers are called from 2 different locations:
  275. 1.) from within the method containing the exception clauses
  276. 2.) from the stack unwinding code
  277. To make this possible we implement them like subroutines,
  278. ending with a "return" statement. The subroutine does not save
  279. the base pointer, because we need access to the local
  280. variables of the enclosing method. Its is possible that
  281. instructions inside those handlers modify the stack pointer,
  282. thus we save the stack pointer at the start of the handler,
  283. and restore it at the end. We have to use a "call" instruction
  284. to execute such finally handlers.
  285. The MIR code for filter and finally handlers looks like:
  286. OP_START_HANDLER
  287. ...
  288. OP_END_FINALLY | OP_ENDFILTER(reg)
  289. OP_START_HANDLER: should save the stack pointer somewhere
  290. OP_END_FINALLY: restores the stack pointers and returns.
  291. OP_ENDFILTER (reg): restores the stack pointers and returns the value in "reg".
  292. ** Calling finally/filter handlers
  293. There is a special opcode to call those handler, its called
  294. OP_CALL_HANDLER. It simple emits a call instruction.
  295. Its a bit more complex to call handler from outside (in the
  296. stack unwinding code), because we have to restore the whole
  297. context of the method first. After that we simply emit a call
  298. instruction to invoke the handler. Its usually possible to use
  299. the same code to call filter and finally handlers (see
  300. arch_get_call_filter).
  301. ** Calling catch handlers
  302. Catch handlers are always called from the stack unwinding
  303. code. Unlike finally clauses or filters, catch handler never
  304. return. Instead we simply restore the whole context, and
  305. restart execution at the catch handler.
  306. ** Passing Exception objects to catch handlers and filters.
  307. We use a local variable to store exception objects. The stack
  308. unwinding code must store the exception object into this
  309. variable before calling catch handler or filter.
  310. * Minor helper methods
  311. A few minor helper methods are referenced from the arch-independent code.
  312. Some of them are:
  313. *) mono_arch_cpu_optimizations ()
  314. This function returns a mask of optimizations that
  315. should be enabled for the current CPU and a mask of
  316. optimizations that should be excluded, instead.
  317. *) mono_arch_regname ()
  318. Returns the name for a numeric register.
  319. *) mono_arch_get_allocatable_int_vars ()
  320. Returns a list of variables that can be allocated to
  321. the integer registers in the current architecture.
  322. *) mono_arch_get_global_int_regs ()
  323. Returns a list of caller-save registers that can be
  324. used to allocate variables in the current method.
  325. *) mono_arch_instrument_mem_needs ()
  326. *) mono_arch_instrument_prolog ()
  327. *) mono_arch_instrument_epilog ()
  328. Functions needed to implement the profiling interface.
  329. * Testing the port
  330. The JIT has a set of regression tests in *.cs files inside the mini directory.
  331. The usual method of testing a port is by compiling these tests on another machine
  332. with a working runtime by typing 'make rcheck', then copying TestDriver.dll and
  333. *.exe to the mini directory. The tests can be run by typing:
  334. ./mono --regression <exe file name>
  335. The suggested order for working through these tests is the following:
  336. - basic.exe
  337. - basic-long.exe
  338. - basic-float.exe
  339. - basic-calls.exe
  340. - objects.exe
  341. - arrays.exe
  342. - exceptions.exe
  343. - iltests.exe
  344. - generics.exe
  345. * Writing regression tests
  346. Regression tests for the JIT should be written for any bug
  347. found in the JIT in one of the *.cs files in the mini
  348. directory. Eventually all the operations of the JIT should be
  349. tested (including the ones that get selected only when some
  350. specific optimization is enabled).
  351. * Platform specific optimizations
  352. An example of a platform-specific optimization is the peephole
  353. optimization: we look at a small window of code at a time and
  354. we replace one or more instructions with others that perform
  355. better for the given architecture or CPU.
  356. * 64 bit support tips, by Zoltan Varga ([email protected])
  357. For a 64-bit port of the Mono runtime, you will typically do
  358. the following:
  359. * need to use inssel-long.brg instead of
  360. inssel-long32.brg.
  361. * need to implement lots of new opcodes:
  362. OP_I<OP> is 32 bit op
  363. OP_L<OP> and CEE_<OP> are 64 bit ops
  364. The 64 bit version of an existing port might share the code
  365. with the 32 bit port (for example SPARC/SPARV9), or it might
  366. be separate (x86/AMD64).
  367. That will depend on the similarities of the two instructions
  368. sets/ABIs etc.
  369. The runtime and most parts of the JIT are 64 bit clean
  370. at this point, so the only parts which require changing are
  371. the arch dependent files.