| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478 |
- Mono JIT porting guide.
- Paolo Molaro ([email protected])
- * Introduction
- This documents describes the process of porting the mono JIT
- to a new CPU architecture. The new mono JIT has been designed
- to make porting easier though at the same time enable the port
- to take full advantage from the new architecture features and
- instructions. Knowledge of the mini architecture (described in
- the mini-doc.txt file) is a requirement for understanding this
- guide, as well as an earlier document about porting the mono
- interpreter (available on the web site).
-
- There are six main areas that a port needs to implement to
- have a fully-functional JIT for a given architecture:
-
- 1) instruction selection
- 2) native code emission
- 3) call conventions and register allocation
- 4) method trampolines
- 5) exception handling
- 6) minor helper methods
-
- To take advantage of some not-so-common processor features
- (for example conditional execution of instructions as may be
- found on ARM or ia64), it may be needed to develop an
- high-level optimization, but doing so is not a requirement for
- getting the JIT to work.
-
- We'll see in more details each of the steps required, note,
- though, that a new port may just as well start from a
- cut&paste of an existing port to a similar architecture (for
- example from x86 to amd64, or from powerpc to sparc).
-
- The architecture specific code is split from the rest of the
- JIT, for example the x86 specific code and data is all
- included in the following files in the distribution:
-
- mini-x86.h mini-x86.c
- inssel-x86.brg
- cpu-pentium.md
- tramp-x86.c
- exceptions-x86.c
-
- I suggest a similar split for other architectures as well.
-
- Note that this document is still incomplete: some sections are
- only sketched and some are missing, but the important info to
- get a port going is already described.
- * Architecture-specific instructions and instruction selection.
- The JIT already provides a set of instructions that can be
- easily mapped to a great variety of different processor
- instructions. Sometimes it may be necessary or advisable to
- add a new instruction that represent more closely an
- instruction in the architecture. Note that a mini instruction
- can be used to represent also a short sequence of CPU
- low-level instructions, but note that each instruction
- represents the minimum amount of code the instruction
- scheduler will handle (i.e., the scheduler won't schedule the
- instructions that compose the low-level sequence as individual
- instructions, but just the whole sequence, as an indivisible
- block).
- New instructions are created by adding a line in the
- mini-ops.h file, assigning an opcode and a name. To specify
- the input and output for the instruction, there are two
- different places, depending on the context in which the
- instruction gets used.
- If the instruction is used in the tree representation, the
- input and output types are defined by the BURG rules in the
- *.brg files (the usual non-terminals are 'reg' to represent a
- normal register, 'lreg' to represent a register or two that
- hold a 64 bit value, freg for a floating point register).
- If an instruction is used as a low-level CPU instruction, the
- info is specified in a machine description file. The
- description file is processed by the genmdesc program to
- provide a data structure that can be easily used from C code
- to query the needed info about the instruction.
- As an example, let's consider the add instruction for both x86
- and ppc:
-
- x86 version:
- add: dest:i src1:i src2:i len:2 clob:1
- ppc version:
- add: dest:i src1:i src2:i len:4
-
- Note that the instruction takes two input integer registers on
- both CPU, but on x86 the first source register is clobbered
- (clob:1) and the length in bytes of the instruction differs.
- Note that integer adds and floating point adds use different
- opcodes, unlike the IL language (64 bit add is done with two
- instructions on 32 bit architectures, using a add that sets
- the carry and an add with carry).
- A specific CPU port may assign any meaning to the clob field
- for an instruction since the value will be processed in an
- arch-specific file anyway.
- See the top of the existing cpu-pentium.md file for more info
- on other fields: the info may or may not be applicable to a
- different CPU, in this latter case the info can be ignored.
- The code in mini.c together with the BURG rules in inssel.brg,
- inssel-float.brg and inssel-long32.brg provides general
- purpose mappings from the tree representation to a set of
- instructions that should be easily implemented in any
- architecture. To allow for additional arch-specific
- functionality, an arch-specific BURG file can be used: in this
- file arch-specific instructions can be selected that provide
- better performance than the general instructions or that
- provide functionality that is needed by the JIT but that
- cannot be expressed in a general enough way.
-
- As an example, x86 has the special instruction "push" to make
- it easier to implement the default call convention (passing
- arguments on the stack): almost all the other architectures
- don't have such an instruction (and don't need it anyway), so
- we added a special rule in the inssel-x86.brg file for it.
-
- So, one of the first things needed in a port is to write a
- cpu-$(arch).md machine description file and fill it with the
- needed info. As a start, only a few instructions can be
- specified, like the ones required to do simple integer
- operations. The default rules of the instruction selector will
- emit the common instructions and so we're ready to go for the
- next step in porting the JIT.
-
- *) Native code emission
- Since the first step in porting mono to a new CPU is to port
- the interpreter, there should be already a file that allows
- the emission of binary native code in a buffer for the
- architecture. This file should be placed in the
- mono/arch/$(arch)/
- directory.
- The bulk of the code emission happens in the mini-$(arch).c
- file, in a function called mono_arch_output_basic_block
- (). This function takes a basic block, walks the list of
- instructions in the block and emits the binary code for each.
- Optionally a peephole optimization pass is done on the basic
- block, but this can be left for later, when the port actually
- works.
- This function is very simple, there is just a big switch on
- the instruction opcode and in the corresponding case the
- functions or macros to emit the binary native code are
- used. Note that in this function the lengths of the
- instructions are used to determine if the buffer for the code
- needs enlarging.
-
- To complete the code emission for a method, a few other
- functions need implementing as well:
-
- mono_arch_emit_prolog ()
- mono_arch_emit_epilog ()
- mono_arch_patch_code ()
-
- mono_arch_emit_prolog () will emit the code to setup the stack
- frame for a method, optionally call the callbacks used in
- profiling and tracing, and move the arguments to their home
- location (in a caller-save register if the variable was
- allocated to one, or in a stack location if the argument was
- passed in a volatile register and wasn't allocated a
- non-volatile one). caller-save registers used by the function
- are saved in the prolog as well.
-
- mono_arch_emit_epilog () will emit the code needed to return
- from the function, optionally calling the profiling or tracing
- callbacks. At this point the basic blocks or the code that was
- moved out of the normal flow for the function can be emitted
- as well (this is usually done to provide better info for the
- static branch predictor). In the epilog, caller-save
- registers are restored if they were used.
- Note that, to help exception handling and stack unwinding,
- when there is a transition from managed to unmanaged code,
- some special processing needs to be done (basically, saving
- all the registers and setting up the links in the Last Managed
- Frame structure).
-
- When the epilog has been emitted, the upper level code
- arranges for the buffer of memory that contains the native
- code to be copied in an area of executable memory and at this
- point, instructions that use relative addressing need to be
- patched to have the right offsets: this work is done by
- mono_arch_patch_code ().
- * Call conventions and register allocation
- To account for the differences in the call conventions, a few functions need to
- be implemented.
-
- mono_arch_allocate_vars () assigns to both arguments and local
- variables the offset relative to the frame register where they
- are stored, dead variables are simply discarded. The total
- amount of stack needed is calculated.
-
- mono_arch_call_opcode () is the function that more closely
- deals with the call convention on a given system. For each
- argument to a function call, an instruction is created that
- actually puts the argument where needed, be it the stack or a
- specific register. This function can also re-arrange th order
- of evaluation when multiple arguments are involved if needed
- (like, on x86 arguments are pushed on the stack in reverse
- order). The function needs to carefully take into accounts
- platform specific issues, like how structures are returned as
- well as the differences in size and/or alignment of managed
- and corresponding unmanaged structures.
-
- The other chunk of code that needs to deal with the call
- convention and other specifics of a CPU, is the local register
- allocator, implemented in a function named
- mono_arch_local_regalloc (). The local allocator deals with a
- basic block at a time and basically just allocates registers
- for temporary values during expression evaluation, spilling
- and unspilling as necessary.
- The local allocator needs to take into account clobbering
- information, both during simple instructions and during
- function calls and it needs to deal with other
- architecture-specific weirdnesses, like instructions that take
- inputs only in specific registers or output only is some.
- Some effort will be put later in moving most of the local
- register allocator to a common file so that the code can be
- shared more for similar, risc-like CPUs. The register
- allocator does a first pass on the instructions in a block,
- collecting liveness information and in a backward pass on the
- same list performs the actual register allocation, inserting
- the instructions needed to spill values, if necessary.
- The cross-platform local register allocator is now implemented
- and it is documented in the jit-regalloc file.
-
- When this part of code is implemented, some testing can be
- done with the generated code for the new architecture. Most
- helpful is the use of the --regression command line switch to
- run the regression tests (basic.cs, for example).
- Note that the JIT will try to initialize the runtime, but it
- may not be able yet to compile and execute complex code:
- commenting most of the code in the mini_init() function in
- mini.c is needed to let the JIT just compile the regression
- tests. Also, using multiple -v switches on the command line
- makes the JIT dump an increasing amount of information during
- compilation.
- Values loaded into registers need to be extened as needed by
- the ECMA specs:
- *) integers smaller than 4 bytes are extended to int32 values
- *) 32 bit floats are extended to double precision (in particular
- this means that currently all the floating point operations operate
- on doubles)
-
- * Method trampolines
- To get better startup performance, the JIT actually compiles a
- method only when needed. To achieve this, when a call to a
- method is compiled, we actually emit a call to a magic
- trampoline. The magic trampoline is a function written in
- assembly that invokes the compiler to compile the given method
- and jumps to the newly compiled code, ensuring the arguments
- it received are passed correctly to the actual method.
- Before jumping to the new code, though, the magic trampoline
- takes care of patching the call site so that next time the
- call will go directly to the method instead of the
- trampoline. How does this all work?
- mono_arch_create_jit_trampoline () creates a small function
- that just preserves the arguments passed to it and adds an
- additional argument (the method to compile) before calling the
- generic trampoline. This small function is called the specific
- trampoline, because it is method-specific (the method to
- compile is hard-code in the instruction stream).
- The generic trampoline saves all the arguments that could get
- clobbered and calls a C function that will do two things:
-
- *) actually call the JIT to compile the method
- *) identify the calling code so that it can be patched to call directly
- the actual method
-
- If the 'this' argument to a method is a boxed valuetype that
- is passed to a method that expects just a pointer to the data,
- an additional unboxing trampoline will need to be inserted as
- well.
-
- * Exception handling
- Exception handling is likely the most difficult part of the
- port, as it needs to deal with unwinding (both managed and
- unmanaged code) and calling catch and filter blocks. It also
- needs to deal with signals, because mono takes advantage of
- the MMU in the CPU and of the operation system to handle
- dereferences of the NULL pointer. Some of the function needed
- to implement the mechanisms are:
-
- mono_arch_get_throw_exception () returns a function that takes
- an exception object and invokes an arch-specific function that
- will enter the exception processing. To do so, all the
- relevant registers need to be saved and passed on.
-
- mono_arch_handle_exception () this function takes the
- exception thrown and a context that describes the state of the
- CPU at the time the exception was thrown. The function needs
- to implement the exception handling mechanism, so it makes a
- search for an handler for the exception and if none is found,
- it follows the unhandled exception path (that can print a
- trace and exit or just abort the current thread). The
- difficulty here is to unwind the stack correctly, by restoring
- the register state at each call site in the call chain,
- calling finally, filters and handler blocks while doing so.
-
- As part of exception handling a couple of internal calls need
- to be implemented as well.
- ves_icall_get_frame_info () returns info about a specific
- frame.
- mono_jit_walk_stack () walks the stack and calls a callback with info for
- each frame found.
- ves_icall_get_trace () return an array of StackFrame objects.
-
- ** Code generation for filter/finally handlers
- Filter and finally handlers are called from 2 different locations:
-
- 1.) from within the method containing the exception clauses
- 2.) from the stack unwinding code
-
- To make this possible we implement them like subroutines,
- ending with a "return" statement. The subroutine does not save
- the base pointer, because we need access to the local
- variables of the enclosing method. Its is possible that
- instructions inside those handlers modify the stack pointer,
- thus we save the stack pointer at the start of the handler,
- and restore it at the end. We have to use a "call" instruction
- to execute such finally handlers.
-
- The MIR code for filter and finally handlers looks like:
-
- OP_START_HANDLER
- ...
- OP_END_FINALLY | OP_ENDFILTER(reg)
-
- OP_START_HANDLER: should save the stack pointer somewhere
- OP_END_FINALLY: restores the stack pointers and returns.
- OP_ENDFILTER (reg): restores the stack pointers and returns the value in "reg".
-
- ** Calling finally/filter handlers
- There is a special opcode to call those handler, its called
- OP_CALL_HANDLER. It simple emits a call instruction.
-
- Its a bit more complex to call handler from outside (in the
- stack unwinding code), because we have to restore the whole
- context of the method first. After that we simply emit a call
- instruction to invoke the handler. Its usually possible to use
- the same code to call filter and finally handlers (see
- arch_get_call_filter).
-
- ** Calling catch handlers
- Catch handlers are always called from the stack unwinding
- code. Unlike finally clauses or filters, catch handler never
- return. Instead we simply restore the whole context, and
- restart execution at the catch handler.
-
- ** Passing Exception objects to catch handlers and filters.
- We use a local variable to store exception objects. The stack
- unwinding code must store the exception object into this
- variable before calling catch handler or filter.
-
- * Minor helper methods
- A few minor helper methods are referenced from the arch-independent code.
- Some of them are:
-
- *) mono_arch_cpu_optimizations ()
- This function returns a mask of optimizations that
- should be enabled for the current CPU and a mask of
- optimizations that should be excluded, instead.
-
- *) mono_arch_regname ()
- Returns the name for a numeric register.
-
- *) mono_arch_get_allocatable_int_vars ()
- Returns a list of variables that can be allocated to
- the integer registers in the current architecture.
-
- *) mono_arch_get_global_int_regs ()
- Returns a list of caller-save registers that can be
- used to allocate variables in the current method.
-
- *) mono_arch_instrument_mem_needs ()
- *) mono_arch_instrument_prolog ()
- *) mono_arch_instrument_epilog ()
- Functions needed to implement the profiling interface.
-
- * Testing the port
- The JIT has a set of regression tests in *.cs files inside the mini directory.
- The usual method of testing a port is by compiling these tests on another machine
- with a working runtime by typing 'make rcheck', then copying TestDriver.dll and
- *.exe to the mini directory. The tests can be run by typing:
- ./mono --regression <exe file name>
- The suggested order for working through these tests is the following:
- - basic.exe
- - basic-long.exe
- - basic-float.exe
- - basic-calls.exe
- - objects.exe
- - arrays.exe
- - exceptions.exe
- - iltests.exe
- - generics.exe
-
- * Writing regression tests
- Regression tests for the JIT should be written for any bug
- found in the JIT in one of the *.cs files in the mini
- directory. Eventually all the operations of the JIT should be
- tested (including the ones that get selected only when some
- specific optimization is enabled).
-
- * Platform specific optimizations
- An example of a platform-specific optimization is the peephole
- optimization: we look at a small window of code at a time and
- we replace one or more instructions with others that perform
- better for the given architecture or CPU.
-
- * 64 bit support tips, by Zoltan Varga ([email protected])
- For a 64-bit port of the Mono runtime, you will typically do
- the following:
- * need to use inssel-long.brg instead of
- inssel-long32.brg.
- * need to implement lots of new opcodes:
- OP_I<OP> is 32 bit op
- OP_L<OP> and CEE_<OP> are 64 bit ops
- The 64 bit version of an existing port might share the code
- with the 32 bit port (for example SPARC/SPARV9), or it might
- be separate (x86/AMD64).
- That will depend on the similarities of the two instructions
- sets/ABIs etc.
- The runtime and most parts of the JIT are 64 bit clean
- at this point, so the only parts which require changing are
- the arch dependent files.
-
|