| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755756757758759760761762763764765766767768769770771 |
- A new JIT compiler for the Mono Project
- Miguel de Icaza (miguel@{ximian.com,gnome.org}),
- Paolo Molaro (lupus@{ximian.com,debian.org})
-
- * Abstract
- Mini is a new compilation engine for the Mono runtime. The
- new engine is designed to bring new code generation
- optimizations, portability and pre-compilation.
- In this document we describe the design decisions and the
- architecture of the new compilation engine.
- * Introduction
- Mono is a Open Source implementation of the .NET Framework: it
- is made up of a runtime engine that implements the ECMA Common
- Language Infrastructure (CLI), a set of compilers that target
- the CLI and a large collection of class libraries.
- This article discusses the new code generation facilities that
- have been added to the Mono runtime.
- First we discuss the overall architecture of the Mono runtime,
- and how code generation fits into it; Then we discuss the
- development and basic architecture of our first JIT compiler
- for the ECMA CIL framework. The next section covers the
- objectives for the work on the new JIT compiler, then we
- discuss the new features available in the new JIT compiler,
- and finally a technical description of the new code generation
- engine.
- * Architecture of the Mono Runtime
- The Mono runtime is an implementation of the ECMA Common
- Language Infrastructure (CLI), whose aim is to be a common
- platform for executing code in multiple languages.
- Languages that target the CLI generate images that contain
- code in high-level intermediate representation called the
- "Common Intermediate Language". This intermediate language is
- rich enough to allow for programs and pre-compiled libraries
- to be reflected. The execution environment allows for an
- object oriented execution environment with single inheritance
- and multiple interface implementations.
- This runtime provides a number of services for programs that
- are targeted to it: Just-in-Time compilation of CIL code into
- native code, garbage collection, thread management, I/O
- routines, single, double and decimal floating point,
- asynchronous method invocation, application domains, and a
- framework for building arbitrary RPC systems (remoting) and
- integration with system libraries through the Platform Invoke
- functionality.
- The focus of this document is on the services provided by the
- Mono runtime to transform CIL bytecodes into code that is
- native to the underlying architecture.
- The code generation interface is a set of macros that allow a
- C programmer to generate code on the fly, this is done
- through a set of macros found in the mono/jit/arch/ directory.
- These macros are used by the JIT compiler to generate native
- code.
- The platform invocation code is interesting, as it generates
- CIL code on the fly to marshal parameters, and then this
- code is in turned processed by the JIT engine.
- * Previous Experiences
- Mono has built a JIT engine, which has been used to bootstrap
- Mono since January, 2002. This JIT engine has reasonable
- performance, and uses an tree pattern matching instruction
- selector based on the BURS technology. This JIT compiler was
- designed by Dietmar Maurer, Paolo Molaro and Miguel de Icaza.
- The existing JIT compiler has three phases:
- * Re-creation of the semantic tree from CIL
- byte-codes.
- * Instruction selection, with a cost-driven
- engine.
- * Code generation and register allocation.
- It is also hooked into the rest of the runtime to provide
- services like marshaling, just-in-time compilation and
- invocation of "internal calls".
- This engine constructed a collection of trees, which we
- referred to as the "forest of trees", this forest is created by
- "hydrating" the CIL instruction stream.
- The first step was to identify the basic blocks on the method,
- and computing the control flow graph (cfg) for it. Once this
- information was computed, a stack analysis on each basic block
- was performed to create a forest of trees for each one of
- them.
- So for example, the following statement:
- int a, b;
- ...
- b = a + 1;
- Which would be represented in CIL as:
- ldloc.0
- ldc.i4.1
- add
- stloc.1
- After the stack analysis would create the following tree:
- (STIND_I4 ADDR_L[EBX|2] (
- ADD (LDIND_I4 ADDR_L[ESI|1])
- CONST_I4[1]))
- This tree contains information from the stack analysis: for
- instance, notice that the operations explicitly encode the
- data types they are operating on, there is no longer an
- ambiguity on the types, because this information has been
- inferred.
- At this point the JIT would pass the constructed forest of
- trees to the architecture-dependent JIT compiler.
- The architecture dependent code then performed register
- allocation (optionally using linear scan allocation for
- variables, based on life analysis).
- Once variables had been assigned, a tree pattern matching with
- dynamic programming is used (the tree pattern matcher is
- custom build for each architecture, using a code
- generator: monoburg). The instruction selector used cost
- functions to select the best instruction patterns.
- The instruction selector is able to produce instructions that
- take advantage of the x86 instruction indexing instructions
- for example.
- One problem though is that the code emitter and the register
- allocator did not have any visibility outside the current
- tree, which meant that some redundant instructions were
- generated. A peephole optimizer with this architecture was
- hard to write, given the tree-based representation that is
- used.
- This JIT was functional, but it did not provide a good
- architecture to base future optimizations on. Also the
- line between architecture neutral and architecture
- specific code and optimizations was hard to draw.
- The JIT engine supported two code generation modes to support
- the two optimization modes for applications that host multiple
- application domains: generate code that will be shared across
- application domains, or generate code that will not be shared
- across application domains.
- * Objectives of the new JIT engine.
- We wanted to support a number of features that were missing:
- * Ahead-of-time compilation.
- The idea is to allow developers to pre-compile their code
- to native code to reduce startup time, and the working
- set that is used at runtime in the just-in-time compiler.
- Although in Mono this has not been a visible problem, we
- wanted to pro-actively address this problem.
- When an assembly (a Mono/.NET executable) is installed in
- the system, it would then be possible to pre-compile the
- code, and have the JIT compiler tune the generated code
- to the particular CPU on which the software is
- installed.
- This is done in the Microsoft.NET world with a tool
- called ngen.exe
- * Have a good platform for doing code optimizations.
- The design called for a good architecture that would
- enable various levels of optimizations: some
- optimizations are better performed on high-level
- intermediate representations, some on medium-level and
- some at low-level representations.
- Also it should be possible to conditionally turn these on
- or off. Some optimizations are too expensive to be used
- in just-in-time compilation scenarios, but these
- expensive optimizations can be turned on for
- ahead-of-time compilations or when using profile-guided
- optimizations on a subset of the executed methods.
- * Reduce the effort required to port the Mono code
- generator to new architectures.
- For Mono to gain wide adoption in the Unix world, it is
- necessary that the JIT engine works in most of today's
- commercial hardware platforms.
- * Features of the new JIT engine.
- The new JIT engine was architected by Dietmar Maurer and Paolo
- Molaro, based on the new objectives.
- Mono provides a number of services to applications running
- with the new JIT compiler:
- * Just-in-Time compilation of CLI code into native code.
- * Ahead-of-Time compilation of CLI code, to reduce
- startup time of applications.
- A number of software development features are also available:
- * Execution time profiling (--profile)
- Generates a report of the times consumed by routines,
- as well as the invocation times, as well as the
- callers.
- * Memory usage profiling (--profile)
- Generates a report of the memory usage by a program
- that is ran under the Mono JIT.
- * Code coverage (--coverage)
- * Execution tracing.
- People who are interested in developing and improving the Mini
- JIT compiler will also find a few useful routines:
- * Compilation times
- This is used to time the execution time for the JIT
- when compiling a routine.
- * Control Flow Graph and Dominator Tree drawing.
- These are visual aids for the JIT developer: they
- render representations of the Control Flow graph, and
- for the more advanced optimizations, they draw the
- dominator tree graph.
- This requires Dot (from the graphwiz package) and Ghostview.
- * Code generator regression tests.
- The engine contains support for running regression
- tests on the virtual machine, which is very helpful to
- developers interested in improving the engine.
- * Optimization benchmark framework.
- The JIT engine will generate graphs that compare
- various benchmarks embedded in an assembly, and run the
- various tests with different optimization flags.
- This requires Perl, GD::Graph.
- * Flexibility
- This is probably the most important component of the new code
- generation engine. The internals are relatively easy to
- replace and update, even large passes can be replaced and
- implemented differently.
- * New code generator
- Compiling a method begins with the `mini_method_to_ir' routine
- that converts the CIL representation into a medium
- intermediate representation.
- The mini_method_to_ir routine performs a number of operations:
- * Flow analysis and control flow graph computation.
- Unlike the previous version, stack analysis and control
- flow graphs are computed in a single pass in the
- mini_method_to_ir function, this is done for performance
- reasons: although the complexity increases, the benefit
- for a JIT compiler is that there is more time available
- for performing other optimizations.
- * Basic block computation.
- mini_method_to_ir populates the MonoCompile structure
- with an array of basic blocks each of which contains
- forest of trees made up of MonoInst structures.
- * Inlining
- Inlining is no longer restricted to methods containing
- one single basic block, instead it is possible to inline
- arbitrary complex methods.
- The heuristics to choose what to inline are likely going
- to be tuned in the future.
- * Method to opcode conversion.
- Some method call invocations like `call Math.Sin' are
- transformed into an opcode: this transforms the call
- into a semantically rich node, which is later inline
- into an FPU instruction.
- Various Array methods invocations are turned into
- opcodes as well (The Get, Set and Address methods)
- * Tail recursion elimination
- Basic blocks ****
- The MonoInst structure holds the actual decoded instruction,
- with the semantic information from the stack analysis.
- MonoInst is interesting because initially it is part of a tree
- structure, here is a sample of the same tree with the new JIT
- engine:
- (stind.i4 regoffset[0xffffffd4(%ebp)]
- (add (ldind.i4 regoffset[0xffffffd8(%ebp)])
- iconst[1]))
- This is a medium-level intermediate representation (MIR).
- Some complex opcodes are decomposed at this stage into a
- collection of simpler opcodes. Not every complex opcode is
- decomposed at this stage, as we need to preserve the semantic
- information during various optimization phases.
- For example a NEWARR opcode carries the length and the type of
- the array that could be used later to avoid type checking or
- array bounds check.
- There are a number of operations supported on this
- representation:
- * Branch optimizations.
- * Variable liveness.
- * Loop optimizations: the dominator trees are
- computed, loops are detected, and their nesting
- level computed.
- * Conversion of the method into static single assignment
- form (SSA form).
- * Dead code elimination.
- * Constant propagation.
- * Copy propagation.
- * Constant folding.
- Once the above optimizations are optionally performed, a
- decomposition phase is used to turn some complex opcodes into
- internal method calls. In the initial version of the JIT
- engine, various operations on longs are emulated instead of
- being inlined. Also the newarr invocation is turned into a
- call to the runtime.
- At this point, after computing variable liveness, it is
- possible to use the linear scan algorithm for allocating
- variables to registers. The linear scan pass uses the
- information that was previously gathered by the loop nesting
- and loop structure computation to favor variables in inner
- loops. This process updates the basic block `nesting' field
- which is later used during liveness analysis.
- Stack space is then reserved for the local variables and any
- temporary variables generated during the various
- optimizations.
- ** Instruction selection
- At this point, the BURS instruction selector is invoked to
- transform the tree-based representation into a list of
- instructions. This is done using a tree pattern matcher that
- is generated for the architecture using the `monoburg' tool.
- Monoburg takes as input a file that describes tree patterns,
- which are matched against the trees that were produced by the
- engine in the previous stages.
- The pattern matching might have more than one match for a
- particular tree. In this case, the match selected is the one
- whose cost is the smallest. A cost can be attached to each
- rule, and if no cost is provided, the implicit cost is one.
- Smaller costs are selected over higher costs.
- The cost function can be used to select particular blocks of
- code for a given architecture, or by using a prohibitive high
- number to avoid having the rule match.
- The various rules that our JIT engine uses transform a tree of
- MonoInsts into a list of monoinsts:
- +-----------------------------------------------------------+
- | Tree List |
- | of ===> Instruction selection ===> of |
- | MonoInst MonoInst. |
- +-----------------------------------------------------------+
- During this process various "types" of MonoInst kinds
- disappear and turned into lower-level representations. The
- JIT compiler just happens to reuse the same structure (this is
- done to reduce memory usage and improve memory locality).
- The instruction selection rules are split in a number of
- files, each one with a particular purpose:
- inssel.brg
- Contains the generic instruction selection
- patterns.
- inssel-x86.brg
- Contains x86 specific rules.
- inssel-ppc.brg
- Contains PowerPC specific rules.
- inssel-long32.brg
- burg file for 64bit instructions on 32bit architectures.
- inssel-long.brg
- burg file for 64bit architectures.
- inssel-float.brg
- burg file for floating point instructions
-
- For a given build, a set of those files would be included.
- For example, for the build of Mono on the x86, the following
- set is used:
- inssel.brg inssel-x86.brg inssel-long32.brg inssel-float.brg
- ** Native method generation
- The native method generation has a number of steps:
- * Architecture specific register allocation.
- The information about loop nesting that was
- previously gathered is used here to hint the
- register allocator.
- * Generating the method prolog/epilog.
- * Optionally generate code to introduce tracing facilities.
- * Hooking into the debugger.
- * Performing any pending fixups.
- * Code generation.
- *** Code Generation
- The actual code generation is contained in the architecture
- specific portion of the compiler. The input to the code
- generator is each one of the basic blocks with its list of
- instructions that were produced in the instruction selection
- phase.
- During the instruction selection phase, virtual registers are
- assigned. Just before the peephole optimization is performed,
- physical registers are assigned.
- A simple peephole and algebraic optimizer is ran at this
- stage.
- The peephole optimizer removes some redundant operations at
- this point. This is possible because the code generation at
- this point has visibility into the basic block that spans the
- original trees.
- The algebraic optimizer performs some simple algebraic
- optimizations that replace expensive operations with cheaper
- operations if possible.
- The rest of the code generation is fairly simple: a switch
- statement is used to generate code for each of the MonoInsts,
- in the mono/mini/mini-ARCH.c files, the method is called
- "mono_arch_output_basic_block".
- We always try to allocate code in sequence, instead of just using
- malloc. This way we increase spatial locality which gives a massive
- speedup on most architectures.
- *** Ahead of Time compilation
- Ahead-of-Time compilation is a new feature of our new
- compilation engine. The compilation engine is shared by the
- Just-in-Time (JIT) compiler and the Ahead-of-Time compiler
- (AOT).
- The difference is on the set of optimizations that are turned
- on for each mode: Just-in-Time compilation should be as fast
- as possible, while Ahead-of-Time compilation can take as long
- as required, because this is not done at a time critical
- time.
- With AOT compilation, we can afford to turn all of the
- computationally expensive optimizations on.
- After the code generation phase is done, the code and any
- required fixup information is saved into a file that is
- readable by "as" (the native assembler available on all
- systems). This assembly file is then passed to the native
- assembler, which generates a loadable module.
- At execution time, when an assembly is loaded from the disk,
- the runtime engine will probe for the existence of a
- pre-compiled image. If the pre-compiled image exists, then it
- is loaded, and the method invocations are resolved to the code
- contained in the loaded module.
- The code generated under the AOT scenario is slightly
- different than the JIT scenario. It generates code that is
- application-domain relative and that can be shared among
- multiple thread.
- This is the same code generation that is used when the runtime
- is instructed to maximize code sharing on a multi-application
- domain scenario.
- * SSA-based optimizations
- SSA form simplifies many optimization because each variable
- has exactly one definition site. This means that each
- variable is only initialized once.
- For example, code like this:
- a = 1
- ..
- a = 2
- call (a)
- Is internally turned into:
- a1 = 1
- ..
- a2 = 2
- call (a2)
- In the presence of branches, like:
- if (x)
- a = 1
- else
- a = 2
- call (a)
- The code is turned into:
- if (x)
- a1 = 1;
- else
- a2 = 2;
- a3 = phi (a1, a2)
- call (a3)
- All uses of a variable are "dominated" by its definition
- This representation is useful as it simplifies the
- implementation of a number of optimizations like conditional
- constant propagation, array bounds check removal and dead code
- elimination.
- * Register allocation.
- Global register allocation is performed on the medium
- intermediate representation just before instruction selection
- is performed on the method. Local register allocation is
- later performed at the basic-block level on the
- Global register allocation uses the following input:
- 1) set of register-sized variables that can be allocated to a
- register (this is an architecture specific setting, for x86
- these registers are the callee saved register ESI, EDI and
- EBX).
- 2) liveness information for the variables
- 3) (optionally) loop info to favor variables that are used in
- inner loops.
- During instruction selection phase, symbolic registers are
- assigned to temporary values in expressions.
- Local register allocation assigns hard registers to the
- symbolic registers, and it is performed just before the code
- is actually emitted and is performed at the basic block level.
- A CPU description file describes the input registers, output
- registers, fixed registers and clobbered registers by each
- operation.
- * BURG Code Generator Generator
- monoburg was written by Dietmar Maurer. It is based on the
- papers from Christopher W. Fraser, Robert R. Henry and Todd
- A. Proebsting: "BURG - Fast Optimal Instruction Selection and
- Tree Parsing" and "Engineering a Simple, Efficient Code
- Generator Generator".
- The original BURG implementation is unable to work on DAGs, instead only
- trees are allowed. Our monoburg implementations is able to generate tree
- matcher which works on DAGs, and we use this feature in the new
- JIT. This simplifies the code because we can directly pass DAGs and
- don't need to convert them to trees.
- * Adding IL opcodes: an excercise (from a post by Paolo Molaro)
- mini.c is the file that read the IL code stream and decides
- how any single IL instruction is implemented
- (mono_method_to_ir () func), so you always have to add an
- entry to the big switch inside the function: there are plenty
- of examples in that file.
- An IL opcode can be implemented in a number of ways, depending
- on what it does and how it needs to do it.
-
- Some opcodes are implemented using a helper function: one of
- the simpler examples is the CEE_STELEM_REF implementation.
- In this case the opcode implementation is written in a C
- function. You will need to register the function with the jit
- before you can use it (mono_register_jit_call) and you need to
- emit the call to the helper using the mono_emit_jit_icall()
- function.
- This is the simpler way to add a new opcode and it doesn't
- require any arch-specific change (though it's limited to what
- you can do in C code and the performance may be limited by the
- function call).
-
- Other opcodes can be implemented with one or more of the already
- implemented low-level instructions.
- An example is the OP_STRLEN opcode which implements
- String.Length using a simple load from memory. In this case
- you need to add a rule to the appropriate burg file,
- describing what are the arguments of the opcode and what is,
- if any, it's 'return' value.
- The OP_STRLEN case is:
-
- reg: OP_STRLEN (reg) {
- MONO_EMIT_LOAD_MEMBASE_OP (s, tree, OP_LOADI4_MEMBASE, state->reg1,
- state->left->reg1, G_STRUCT_OFFSET (MonoString, length));
- }
- The above means: the OP_STRLEN takes a register as an argument
- and returns its value in a register. And the implementation
- of this is included in the braces.
-
- The opcode returns a value in an integer register
- (state->reg1) by performing a int32 load of the length field
- of the MonoString represented by the input register
- (state->left->reg1): before the burg rules are applied, the
- internal representation is based on trees, so you get the
- left/right pointers (state->left and state->right
- respectively, the result is stored in state->reg1).
- This instruction implementation doesn't require arch-specific
- changes (it is using the MONO_EMIT_LOAD_MEMBASE_OP which is
- available on all platforms), and usually the produced code is
- fast.
-
- Next we have opcodes that must be implemented with new low-level
- architecture specific instructions (either because of performance
- considerations or because the functionality can't get implemented in
- other ways).
- You also need a burg rule in this case, too. For example,
- consider the OP_CHECK_THIS opcode (used to raise an exception
- if the this pointer is null). The burg rule simply reads:
-
- stmt: OP_CHECK_THIS (reg) {
- mono_bblock_add_inst (s->cbb, tree);
- }
-
- Note that this opcode does not return a value (hence the
- "stmt") and it takes a register as input.
- mono_bblock_add_inst (s->cbb, tree) just adds the instruction
- (the tree variable) to the current basic block (s->cbb). In
- mini this is the place where the internal representation
- switches from the tree format to the low-level format (the
- list of simple instructions).
- In this case the actual opcode implementation is delegated to
- the arch-specific code. A low-level opcode needs an entry in
- the machine description (the *.md files in mini/). This entry
- describes what kind of registers are used if any by the
- instruction, as well as other details such as constraints or
- other hints to the low-level engine which are architecture
- specific.
- cpu-pentium.md, for example has the following entry:
-
- checkthis: src1:b len:3
-
- This means the instruction uses an integer register as a base
- pointer (basically a load or store is done on it) and it takes
- 3 bytes of native code to implement it.
- Now you just need to provide the low-level implementation for
- the opcode in one of the mini-$arch.c files, in the
- mono_arch_output_basic_block() function. There is a big switch
- here too. The x86 implementation is:
- case OP_CHECK_THIS:
- /* ensure ins->sreg1 is not NULL */
- x86_alu_membase_imm (code, X86_CMP, ins->sreg1, 0, 0);
- break;
-
- If the $arch-codegen.h header file doesn't have the code to
- emit the low-level native code, you'll need to write that as
- well.
- Complex opcodes with register constraints may require other
- changes to the local register allocator, but usually they are
- not needed.
-
- * Future
- Profile-based optimization is something that we are very
- interested in supporting. There are two possible usage
- scenarios:
- * Based on the profile information gathered during
- the execution of a program, hot methods can be compiled
- with the highest level of optimizations, while bootstrap
- code and cold methods can be compiled with the least set
- of optimizations and placed in a discardable list.
- * Code reordering: this profile-based optimization would
- only make sense for pre-compiled code. The profile
- information is used to re-order the assembly code on disk
- so that the code is placed on the disk in a way that
- increments locality.
- This is the same principle under which SGI's cord program
- works.
- The nature of the CIL allows the above optimizations to be
- easy to implement and deploy. Since we live and define our
- universe for these things, there are no interactions with
- system tools required, nor upgrades on the underlying
- infrastructure required.
- Instruction scheduling is important for certain kinds of
- processors, and some of the framework exists today in our
- register allocator and the instruction selector to cope with
- this, but has not been finished. The instruction selection
- would happen at the same time as local register allocation. <
|