mini-doc.txt 27 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755756757758759760761762763764765766767768769770771
  1. A new JIT compiler for the Mono Project
  2. Miguel de Icaza (miguel@{ximian.com,gnome.org}),
  3. Paolo Molaro (lupus@{ximian.com,debian.org})
  4. * Abstract
  5. Mini is a new compilation engine for the Mono runtime. The
  6. new engine is designed to bring new code generation
  7. optimizations, portability and pre-compilation.
  8. In this document we describe the design decisions and the
  9. architecture of the new compilation engine.
  10. * Introduction
  11. Mono is a Open Source implementation of the .NET Framework: it
  12. is made up of a runtime engine that implements the ECMA Common
  13. Language Infrastructure (CLI), a set of compilers that target
  14. the CLI and a large collection of class libraries.
  15. This article discusses the new code generation facilities that
  16. have been added to the Mono runtime.
  17. First we discuss the overall architecture of the Mono runtime,
  18. and how code generation fits into it; Then we discuss the
  19. development and basic architecture of our first JIT compiler
  20. for the ECMA CIL framework. The next section covers the
  21. objectives for the work on the new JIT compiler, then we
  22. discuss the new features available in the new JIT compiler,
  23. and finally a technical description of the new code generation
  24. engine.
  25. * Architecture of the Mono Runtime
  26. The Mono runtime is an implementation of the ECMA Common
  27. Language Infrastructure (CLI), whose aim is to be a common
  28. platform for executing code in multiple languages.
  29. Languages that target the CLI generate images that contain
  30. code in high-level intermediate representation called the
  31. "Common Intermediate Language". This intermediate language is
  32. rich enough to allow for programs and pre-compiled libraries
  33. to be reflected. The execution environment allows for an
  34. object oriented execution environment with single inheritance
  35. and multiple interface implementations.
  36. This runtime provides a number of services for programs that
  37. are targeted to it: Just-in-Time compilation of CIL code into
  38. native code, garbage collection, thread management, I/O
  39. routines, single, double and decimal floating point,
  40. asynchronous method invocation, application domains, and a
  41. framework for building arbitrary RPC systems (remoting) and
  42. integration with system libraries through the Platform Invoke
  43. functionality.
  44. The focus of this document is on the services provided by the
  45. Mono runtime to transform CIL bytecodes into code that is
  46. native to the underlying architecture.
  47. The code generation interface is a set of macros that allow a
  48. C programmer to generate code on the fly, this is done
  49. through a set of macros found in the mono/jit/arch/ directory.
  50. These macros are used by the JIT compiler to generate native
  51. code.
  52. The platform invocation code is interesting, as it generates
  53. CIL code on the fly to marshal parameters, and then this
  54. code is in turned processed by the JIT engine.
  55. * Previous Experiences
  56. Mono has built a JIT engine, which has been used to bootstrap
  57. Mono since January, 2002. This JIT engine has reasonable
  58. performance, and uses an tree pattern matching instruction
  59. selector based on the BURS technology. This JIT compiler was
  60. designed by Dietmar Maurer, Paolo Molaro and Miguel de Icaza.
  61. The existing JIT compiler has three phases:
  62. * Re-creation of the semantic tree from CIL
  63. byte-codes.
  64. * Instruction selection, with a cost-driven
  65. engine.
  66. * Code generation and register allocation.
  67. It is also hooked into the rest of the runtime to provide
  68. services like marshaling, just-in-time compilation and
  69. invocation of "internal calls".
  70. This engine constructed a collection of trees, which we
  71. referred to as the "forest of trees", this forest is created by
  72. "hydrating" the CIL instruction stream.
  73. The first step was to identify the basic blocks on the method,
  74. and computing the control flow graph (cfg) for it. Once this
  75. information was computed, a stack analysis on each basic block
  76. was performed to create a forest of trees for each one of
  77. them.
  78. So for example, the following statement:
  79. int a, b;
  80. ...
  81. b = a + 1;
  82. Which would be represented in CIL as:
  83. ldloc.0
  84. ldc.i4.1
  85. add
  86. stloc.1
  87. After the stack analysis would create the following tree:
  88. (STIND_I4 ADDR_L[EBX|2] (
  89. ADD (LDIND_I4 ADDR_L[ESI|1])
  90. CONST_I4[1]))
  91. This tree contains information from the stack analysis: for
  92. instance, notice that the operations explicitly encode the
  93. data types they are operating on, there is no longer an
  94. ambiguity on the types, because this information has been
  95. inferred.
  96. At this point the JIT would pass the constructed forest of
  97. trees to the architecture-dependent JIT compiler.
  98. The architecture dependent code then performed register
  99. allocation (optionally using linear scan allocation for
  100. variables, based on life analysis).
  101. Once variables had been assigned, a tree pattern matching with
  102. dynamic programming is used (the tree pattern matcher is
  103. custom build for each architecture, using a code
  104. generator: monoburg). The instruction selector used cost
  105. functions to select the best instruction patterns.
  106. The instruction selector is able to produce instructions that
  107. take advantage of the x86 instruction indexing instructions
  108. for example.
  109. One problem though is that the code emitter and the register
  110. allocator did not have any visibility outside the current
  111. tree, which meant that some redundant instructions were
  112. generated. A peephole optimizer with this architecture was
  113. hard to write, given the tree-based representation that is
  114. used.
  115. This JIT was functional, but it did not provide a good
  116. architecture to base future optimizations on. Also the
  117. line between architecture neutral and architecture
  118. specific code and optimizations was hard to draw.
  119. The JIT engine supported two code generation modes to support
  120. the two optimization modes for applications that host multiple
  121. application domains: generate code that will be shared across
  122. application domains, or generate code that will not be shared
  123. across application domains.
  124. * Objectives of the new JIT engine.
  125. We wanted to support a number of features that were missing:
  126. * Ahead-of-time compilation.
  127. The idea is to allow developers to pre-compile their code
  128. to native code to reduce startup time, and the working
  129. set that is used at runtime in the just-in-time compiler.
  130. Although in Mono this has not been a visible problem, we
  131. wanted to pro-actively address this problem.
  132. When an assembly (a Mono/.NET executable) is installed in
  133. the system, it would then be possible to pre-compile the
  134. code, and have the JIT compiler tune the generated code
  135. to the particular CPU on which the software is
  136. installed.
  137. This is done in the Microsoft.NET world with a tool
  138. called ngen.exe
  139. * Have a good platform for doing code optimizations.
  140. The design called for a good architecture that would
  141. enable various levels of optimizations: some
  142. optimizations are better performed on high-level
  143. intermediate representations, some on medium-level and
  144. some at low-level representations.
  145. Also it should be possible to conditionally turn these on
  146. or off. Some optimizations are too expensive to be used
  147. in just-in-time compilation scenarios, but these
  148. expensive optimizations can be turned on for
  149. ahead-of-time compilations or when using profile-guided
  150. optimizations on a subset of the executed methods.
  151. * Reduce the effort required to port the Mono code
  152. generator to new architectures.
  153. For Mono to gain wide adoption in the Unix world, it is
  154. necessary that the JIT engine works in most of today's
  155. commercial hardware platforms.
  156. * Features of the new JIT engine.
  157. The new JIT engine was architected by Dietmar Maurer and Paolo
  158. Molaro, based on the new objectives.
  159. Mono provides a number of services to applications running
  160. with the new JIT compiler:
  161. * Just-in-Time compilation of CLI code into native code.
  162. * Ahead-of-Time compilation of CLI code, to reduce
  163. startup time of applications.
  164. A number of software development features are also available:
  165. * Execution time profiling (--profile)
  166. Generates a report of the times consumed by routines,
  167. as well as the invocation times, as well as the
  168. callers.
  169. * Memory usage profiling (--profile)
  170. Generates a report of the memory usage by a program
  171. that is ran under the Mono JIT.
  172. * Code coverage (--coverage)
  173. * Execution tracing.
  174. People who are interested in developing and improving the Mini
  175. JIT compiler will also find a few useful routines:
  176. * Compilation times
  177. This is used to time the execution time for the JIT
  178. when compiling a routine.
  179. * Control Flow Graph and Dominator Tree drawing.
  180. These are visual aids for the JIT developer: they
  181. render representations of the Control Flow graph, and
  182. for the more advanced optimizations, they draw the
  183. dominator tree graph.
  184. This requires Dot (from the graphwiz package) and Ghostview.
  185. * Code generator regression tests.
  186. The engine contains support for running regression
  187. tests on the virtual machine, which is very helpful to
  188. developers interested in improving the engine.
  189. * Optimization benchmark framework.
  190. The JIT engine will generate graphs that compare
  191. various benchmarks embedded in an assembly, and run the
  192. various tests with different optimization flags.
  193. This requires Perl, GD::Graph.
  194. * Flexibility
  195. This is probably the most important component of the new code
  196. generation engine. The internals are relatively easy to
  197. replace and update, even large passes can be replaced and
  198. implemented differently.
  199. * New code generator
  200. Compiling a method begins with the `mini_method_to_ir' routine
  201. that converts the CIL representation into a medium
  202. intermediate representation.
  203. The mini_method_to_ir routine performs a number of operations:
  204. * Flow analysis and control flow graph computation.
  205. Unlike the previous version, stack analysis and control
  206. flow graphs are computed in a single pass in the
  207. mini_method_to_ir function, this is done for performance
  208. reasons: although the complexity increases, the benefit
  209. for a JIT compiler is that there is more time available
  210. for performing other optimizations.
  211. * Basic block computation.
  212. mini_method_to_ir populates the MonoCompile structure
  213. with an array of basic blocks each of which contains
  214. forest of trees made up of MonoInst structures.
  215. * Inlining
  216. Inlining is no longer restricted to methods containing
  217. one single basic block, instead it is possible to inline
  218. arbitrary complex methods.
  219. The heuristics to choose what to inline are likely going
  220. to be tuned in the future.
  221. * Method to opcode conversion.
  222. Some method call invocations like `call Math.Sin' are
  223. transformed into an opcode: this transforms the call
  224. into a semantically rich node, which is later inline
  225. into an FPU instruction.
  226. Various Array methods invocations are turned into
  227. opcodes as well (The Get, Set and Address methods)
  228. * Tail recursion elimination
  229. Basic blocks ****
  230. The MonoInst structure holds the actual decoded instruction,
  231. with the semantic information from the stack analysis.
  232. MonoInst is interesting because initially it is part of a tree
  233. structure, here is a sample of the same tree with the new JIT
  234. engine:
  235. (stind.i4 regoffset[0xffffffd4(%ebp)]
  236. (add (ldind.i4 regoffset[0xffffffd8(%ebp)])
  237. iconst[1]))
  238. This is a medium-level intermediate representation (MIR).
  239. Some complex opcodes are decomposed at this stage into a
  240. collection of simpler opcodes. Not every complex opcode is
  241. decomposed at this stage, as we need to preserve the semantic
  242. information during various optimization phases.
  243. For example a NEWARR opcode carries the length and the type of
  244. the array that could be used later to avoid type checking or
  245. array bounds check.
  246. There are a number of operations supported on this
  247. representation:
  248. * Branch optimizations.
  249. * Variable liveness.
  250. * Loop optimizations: the dominator trees are
  251. computed, loops are detected, and their nesting
  252. level computed.
  253. * Conversion of the method into static single assignment
  254. form (SSA form).
  255. * Dead code elimination.
  256. * Constant propagation.
  257. * Copy propagation.
  258. * Constant folding.
  259. Once the above optimizations are optionally performed, a
  260. decomposition phase is used to turn some complex opcodes into
  261. internal method calls. In the initial version of the JIT
  262. engine, various operations on longs are emulated instead of
  263. being inlined. Also the newarr invocation is turned into a
  264. call to the runtime.
  265. At this point, after computing variable liveness, it is
  266. possible to use the linear scan algorithm for allocating
  267. variables to registers. The linear scan pass uses the
  268. information that was previously gathered by the loop nesting
  269. and loop structure computation to favor variables in inner
  270. loops. This process updates the basic block `nesting' field
  271. which is later used during liveness analysis.
  272. Stack space is then reserved for the local variables and any
  273. temporary variables generated during the various
  274. optimizations.
  275. ** Instruction selection
  276. At this point, the BURS instruction selector is invoked to
  277. transform the tree-based representation into a list of
  278. instructions. This is done using a tree pattern matcher that
  279. is generated for the architecture using the `monoburg' tool.
  280. Monoburg takes as input a file that describes tree patterns,
  281. which are matched against the trees that were produced by the
  282. engine in the previous stages.
  283. The pattern matching might have more than one match for a
  284. particular tree. In this case, the match selected is the one
  285. whose cost is the smallest. A cost can be attached to each
  286. rule, and if no cost is provided, the implicit cost is one.
  287. Smaller costs are selected over higher costs.
  288. The cost function can be used to select particular blocks of
  289. code for a given architecture, or by using a prohibitive high
  290. number to avoid having the rule match.
  291. The various rules that our JIT engine uses transform a tree of
  292. MonoInsts into a list of monoinsts:
  293. +-----------------------------------------------------------+
  294. | Tree List |
  295. | of ===> Instruction selection ===> of |
  296. | MonoInst MonoInst. |
  297. +-----------------------------------------------------------+
  298. During this process various "types" of MonoInst kinds
  299. disappear and turned into lower-level representations. The
  300. JIT compiler just happens to reuse the same structure (this is
  301. done to reduce memory usage and improve memory locality).
  302. The instruction selection rules are split in a number of
  303. files, each one with a particular purpose:
  304. inssel.brg
  305. Contains the generic instruction selection
  306. patterns.
  307. inssel-x86.brg
  308. Contains x86 specific rules.
  309. inssel-ppc.brg
  310. Contains PowerPC specific rules.
  311. inssel-long32.brg
  312. burg file for 64bit instructions on 32bit architectures.
  313. inssel-long.brg
  314. burg file for 64bit architectures.
  315. inssel-float.brg
  316. burg file for floating point instructions
  317. For a given build, a set of those files would be included.
  318. For example, for the build of Mono on the x86, the following
  319. set is used:
  320. inssel.brg inssel-x86.brg inssel-long32.brg inssel-float.brg
  321. ** Native method generation
  322. The native method generation has a number of steps:
  323. * Architecture specific register allocation.
  324. The information about loop nesting that was
  325. previously gathered is used here to hint the
  326. register allocator.
  327. * Generating the method prolog/epilog.
  328. * Optionally generate code to introduce tracing facilities.
  329. * Hooking into the debugger.
  330. * Performing any pending fixups.
  331. * Code generation.
  332. *** Code Generation
  333. The actual code generation is contained in the architecture
  334. specific portion of the compiler. The input to the code
  335. generator is each one of the basic blocks with its list of
  336. instructions that were produced in the instruction selection
  337. phase.
  338. During the instruction selection phase, virtual registers are
  339. assigned. Just before the peephole optimization is performed,
  340. physical registers are assigned.
  341. A simple peephole and algebraic optimizer is ran at this
  342. stage.
  343. The peephole optimizer removes some redundant operations at
  344. this point. This is possible because the code generation at
  345. this point has visibility into the basic block that spans the
  346. original trees.
  347. The algebraic optimizer performs some simple algebraic
  348. optimizations that replace expensive operations with cheaper
  349. operations if possible.
  350. The rest of the code generation is fairly simple: a switch
  351. statement is used to generate code for each of the MonoInsts,
  352. in the mono/mini/mini-ARCH.c files, the method is called
  353. "mono_arch_output_basic_block".
  354. We always try to allocate code in sequence, instead of just using
  355. malloc. This way we increase spatial locality which gives a massive
  356. speedup on most architectures.
  357. *** Ahead of Time compilation
  358. Ahead-of-Time compilation is a new feature of our new
  359. compilation engine. The compilation engine is shared by the
  360. Just-in-Time (JIT) compiler and the Ahead-of-Time compiler
  361. (AOT).
  362. The difference is on the set of optimizations that are turned
  363. on for each mode: Just-in-Time compilation should be as fast
  364. as possible, while Ahead-of-Time compilation can take as long
  365. as required, because this is not done at a time critical
  366. time.
  367. With AOT compilation, we can afford to turn all of the
  368. computationally expensive optimizations on.
  369. After the code generation phase is done, the code and any
  370. required fixup information is saved into a file that is
  371. readable by "as" (the native assembler available on all
  372. systems). This assembly file is then passed to the native
  373. assembler, which generates a loadable module.
  374. At execution time, when an assembly is loaded from the disk,
  375. the runtime engine will probe for the existence of a
  376. pre-compiled image. If the pre-compiled image exists, then it
  377. is loaded, and the method invocations are resolved to the code
  378. contained in the loaded module.
  379. The code generated under the AOT scenario is slightly
  380. different than the JIT scenario. It generates code that is
  381. application-domain relative and that can be shared among
  382. multiple thread.
  383. This is the same code generation that is used when the runtime
  384. is instructed to maximize code sharing on a multi-application
  385. domain scenario.
  386. * SSA-based optimizations
  387. SSA form simplifies many optimization because each variable
  388. has exactly one definition site. This means that each
  389. variable is only initialized once.
  390. For example, code like this:
  391. a = 1
  392. ..
  393. a = 2
  394. call (a)
  395. Is internally turned into:
  396. a1 = 1
  397. ..
  398. a2 = 2
  399. call (a2)
  400. In the presence of branches, like:
  401. if (x)
  402. a = 1
  403. else
  404. a = 2
  405. call (a)
  406. The code is turned into:
  407. if (x)
  408. a1 = 1;
  409. else
  410. a2 = 2;
  411. a3 = phi (a1, a2)
  412. call (a3)
  413. All uses of a variable are "dominated" by its definition
  414. This representation is useful as it simplifies the
  415. implementation of a number of optimizations like conditional
  416. constant propagation, array bounds check removal and dead code
  417. elimination.
  418. * Register allocation.
  419. Global register allocation is performed on the medium
  420. intermediate representation just before instruction selection
  421. is performed on the method. Local register allocation is
  422. later performed at the basic-block level on the
  423. Global register allocation uses the following input:
  424. 1) set of register-sized variables that can be allocated to a
  425. register (this is an architecture specific setting, for x86
  426. these registers are the callee saved register ESI, EDI and
  427. EBX).
  428. 2) liveness information for the variables
  429. 3) (optionally) loop info to favor variables that are used in
  430. inner loops.
  431. During instruction selection phase, symbolic registers are
  432. assigned to temporary values in expressions.
  433. Local register allocation assigns hard registers to the
  434. symbolic registers, and it is performed just before the code
  435. is actually emitted and is performed at the basic block level.
  436. A CPU description file describes the input registers, output
  437. registers, fixed registers and clobbered registers by each
  438. operation.
  439. * BURG Code Generator Generator
  440. monoburg was written by Dietmar Maurer. It is based on the
  441. papers from Christopher W. Fraser, Robert R. Henry and Todd
  442. A. Proebsting: "BURG - Fast Optimal Instruction Selection and
  443. Tree Parsing" and "Engineering a Simple, Efficient Code
  444. Generator Generator".
  445. The original BURG implementation is unable to work on DAGs, instead only
  446. trees are allowed. Our monoburg implementations is able to generate tree
  447. matcher which works on DAGs, and we use this feature in the new
  448. JIT. This simplifies the code because we can directly pass DAGs and
  449. don't need to convert them to trees.
  450. * Adding IL opcodes: an excercise (from a post by Paolo Molaro)
  451. mini.c is the file that read the IL code stream and decides
  452. how any single IL instruction is implemented
  453. (mono_method_to_ir () func), so you always have to add an
  454. entry to the big switch inside the function: there are plenty
  455. of examples in that file.
  456. An IL opcode can be implemented in a number of ways, depending
  457. on what it does and how it needs to do it.
  458. Some opcodes are implemented using a helper function: one of
  459. the simpler examples is the CEE_STELEM_REF implementation.
  460. In this case the opcode implementation is written in a C
  461. function. You will need to register the function with the jit
  462. before you can use it (mono_register_jit_call) and you need to
  463. emit the call to the helper using the mono_emit_jit_icall()
  464. function.
  465. This is the simpler way to add a new opcode and it doesn't
  466. require any arch-specific change (though it's limited to what
  467. you can do in C code and the performance may be limited by the
  468. function call).
  469. Other opcodes can be implemented with one or more of the already
  470. implemented low-level instructions.
  471. An example is the OP_STRLEN opcode which implements
  472. String.Length using a simple load from memory. In this case
  473. you need to add a rule to the appropriate burg file,
  474. describing what are the arguments of the opcode and what is,
  475. if any, it's 'return' value.
  476. The OP_STRLEN case is:
  477. reg: OP_STRLEN (reg) {
  478. MONO_EMIT_LOAD_MEMBASE_OP (s, tree, OP_LOADI4_MEMBASE, state->reg1,
  479. state->left->reg1, G_STRUCT_OFFSET (MonoString, length));
  480. }
  481. The above means: the OP_STRLEN takes a register as an argument
  482. and returns its value in a register. And the implementation
  483. of this is included in the braces.
  484. The opcode returns a value in an integer register
  485. (state->reg1) by performing a int32 load of the length field
  486. of the MonoString represented by the input register
  487. (state->left->reg1): before the burg rules are applied, the
  488. internal representation is based on trees, so you get the
  489. left/right pointers (state->left and state->right
  490. respectively, the result is stored in state->reg1).
  491. This instruction implementation doesn't require arch-specific
  492. changes (it is using the MONO_EMIT_LOAD_MEMBASE_OP which is
  493. available on all platforms), and usually the produced code is
  494. fast.
  495. Next we have opcodes that must be implemented with new low-level
  496. architecture specific instructions (either because of performance
  497. considerations or because the functionality can't get implemented in
  498. other ways).
  499. You also need a burg rule in this case, too. For example,
  500. consider the OP_CHECK_THIS opcode (used to raise an exception
  501. if the this pointer is null). The burg rule simply reads:
  502. stmt: OP_CHECK_THIS (reg) {
  503. mono_bblock_add_inst (s->cbb, tree);
  504. }
  505. Note that this opcode does not return a value (hence the
  506. "stmt") and it takes a register as input.
  507. mono_bblock_add_inst (s->cbb, tree) just adds the instruction
  508. (the tree variable) to the current basic block (s->cbb). In
  509. mini this is the place where the internal representation
  510. switches from the tree format to the low-level format (the
  511. list of simple instructions).
  512. In this case the actual opcode implementation is delegated to
  513. the arch-specific code. A low-level opcode needs an entry in
  514. the machine description (the *.md files in mini/). This entry
  515. describes what kind of registers are used if any by the
  516. instruction, as well as other details such as constraints or
  517. other hints to the low-level engine which are architecture
  518. specific.
  519. cpu-pentium.md, for example has the following entry:
  520. checkthis: src1:b len:3
  521. This means the instruction uses an integer register as a base
  522. pointer (basically a load or store is done on it) and it takes
  523. 3 bytes of native code to implement it.
  524. Now you just need to provide the low-level implementation for
  525. the opcode in one of the mini-$arch.c files, in the
  526. mono_arch_output_basic_block() function. There is a big switch
  527. here too. The x86 implementation is:
  528. case OP_CHECK_THIS:
  529. /* ensure ins->sreg1 is not NULL */
  530. x86_alu_membase_imm (code, X86_CMP, ins->sreg1, 0, 0);
  531. break;
  532. If the $arch-codegen.h header file doesn't have the code to
  533. emit the low-level native code, you'll need to write that as
  534. well.
  535. Complex opcodes with register constraints may require other
  536. changes to the local register allocator, but usually they are
  537. not needed.
  538. * Future
  539. Profile-based optimization is something that we are very
  540. interested in supporting. There are two possible usage
  541. scenarios:
  542. * Based on the profile information gathered during
  543. the execution of a program, hot methods can be compiled
  544. with the highest level of optimizations, while bootstrap
  545. code and cold methods can be compiled with the least set
  546. of optimizations and placed in a discardable list.
  547. * Code reordering: this profile-based optimization would
  548. only make sense for pre-compiled code. The profile
  549. information is used to re-order the assembly code on disk
  550. so that the code is placed on the disk in a way that
  551. increments locality.
  552. This is the same principle under which SGI's cord program
  553. works.
  554. The nature of the CIL allows the above optimizations to be
  555. easy to implement and deploy. Since we live and define our
  556. universe for these things, there are no interactions with
  557. system tools required, nor upgrades on the underlying
  558. infrastructure required.
  559. Instruction scheduling is important for certain kinds of
  560. processors, and some of the framework exists today in our
  561. register allocator and the instruction selector to cope with
  562. this, but has not been finished. The instruction selection
  563. would happen at the same time as local register allocation. <