compiler 10 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314
  1. The Internals of the Mono C# Compiler
  2. Miguel de Icaza
  3. ([email protected])
  4. 2002
  5. * Abstract
  6. The Mono C# compiler is a C# compiler written in C# itself.
  7. Its goals are to provide a free and alternate implementation
  8. of the C# language. The Mono C# compiler generates ECMA CIL
  9. images through the use of the System.Reflection.Emit API which
  10. enable the compiler to be platform independent.
  11. * Overview: How the compiler fits together
  12. The compilation process is managed by the compiler driver (it
  13. lives in driver.cs).
  14. The compiler reads a set of C# source code files, and parses
  15. them. Any assemblies or modules that the user might want to
  16. use with his project are loaded after parsing is done.
  17. Once all the files have been parsed, the type hierarchy is
  18. resolved. First interfaces are resolved, then types and
  19. enumerations.
  20. Once the type hierarchy is resolved, every type is populated:
  21. fields, methods, indexers, properties, events and delegates
  22. are entered into the type system.
  23. At this point the program skeleton has been completed. The
  24. next process is to actually emit the code for each of the
  25. executable methods. The compiler drives this from
  26. RootContext.EmitCode.
  27. Each type then has to populate its methods: populating a
  28. method requires creating a structure that is used as the state
  29. of the block being emitted (this is the EmitContext class) and
  30. then generating code for the topmost statement (the Block).
  31. Code generation has two steps: the first step is the semantic
  32. analysis (Resolve method) that resolves any pending tasks, and
  33. guarantees that the code is correct. The second phase is the
  34. actual code emission. All errors are flagged during in the
  35. "Resolution" process.
  36. After all code has been emitted, then the compiler closes all
  37. the types (this basically tells the Reflection.Emit library to
  38. finish up the types), resources, and definition of the entry
  39. point are done at this point, and the output is saved to
  40. disk.
  41. * The parsing process
  42. All the input files that make up a program need to be read in
  43. advance, because C# allows declarations to happen after an
  44. entity is used, for example, the following is a valid program:
  45. class X : Y {
  46. static void Main ()
  47. {
  48. a = "hello"; b = "world";
  49. }
  50. string a;
  51. }
  52. class Y {
  53. public string b;
  54. }
  55. At the time the assignment expression `a = "hello"' is parsed,
  56. it is not know whether a is a class field from this class, or
  57. its parents, or whether it is a property access or a variable
  58. reference. The actual meaning of `a' will not be discvored
  59. until the semantic analysis phase.
  60. ** The Tokenizer and the pre-processor
  61. The tokenizer is contained in the file `cs-tokenizer.cs', and
  62. the main entry point is the `token ()' method. The tokenizer
  63. implements the `yyParser.yyInput' interface, which is what the
  64. Yacc/Jay parser will use when fetching tokens.
  65. Token definitions are generated by jay during the compilation
  66. process, and those can be references from the tokenizer class
  67. with the `Token.' prefix.
  68. Each time a token is returned, the location for the token is
  69. recorded into the `Location' property, that can be accessed by
  70. the parser. The parser retrieves the Location properties as
  71. it builds its internal representation to allow the semantic
  72. analysis phase to produce error messages that can pin point
  73. the location of the problem.
  74. Some tokens have values associated with it, for example when
  75. the tokenizer encounters a string, it will return a
  76. LITERAL_STRING token, and the actual string parsed will be
  77. available in the `Value' property of the tokenizer. The same
  78. mechanism is used to return integers and floating point
  79. numbers.
  80. C# has a limited pre-processor that allows conditional
  81. compilation, but it is not as fully featured as the C
  82. pre-processor, and most notably, macros are missing. This
  83. makes it simple to implement in very few lines and mesh it
  84. with the tokenizer.
  85. The `handle_preprocessing_directive' method in the tokenizer
  86. handles all the pre-processing, and it is invoked when the '#'
  87. symbol is found as the first token in a line.
  88. The state of the pre-processor is contained in a Stack called
  89. `ifstack', this state is used to track the if/elif/else/endif
  90. nesting and the current state. The state is encoded in the
  91. top of the stack as a number of values `TAKING',
  92. `TAKEN_BEFORE', `ELSE_SEEN', `PARENT_TAKING'.
  93. ** Locations
  94. Locations are encoded as a 32-bit number (the Location
  95. struct) that map each input source line to a linear number.
  96. As new files are parsed, the Location manager is informed of
  97. the new file, to allow it to map back from an int constant to
  98. a file + line number.
  99. The tokenizer also tracks the column number for a token, but
  100. this is currently not being used or encoded. It could
  101. probably be encoded in the low 9 bits, allowing for columns
  102. from 1 to 512 to be encoded.
  103. * The Parser
  104. The parser is written using Jay, which is a port of Berkeley
  105. Yacc to Java, that I later ported to C#.
  106. Many people ask why the grammar of the parser does not match
  107. exactly the definition in the C# specification. The reason is
  108. simple: the grammar in the C# specification is designed to be
  109. consumed by humans, and not by a computer program. Before
  110. you can feed this grammar to a tool, it needs to be simplified
  111. to allow the tool to generate a correct parser for it.
  112. In the Mono C# compiler, we use a class for each of the
  113. statements and expressions in the C# language. For example,
  114. there is a `While' class for the the `while' statement, a
  115. `Cast' class to represent a cast expression and so on.
  116. There is a Statement class, and an Expression class which are
  117. the base classes for statements and expressions.
  118. ** Namespaces
  119. Using list.
  120. * Internal Representation
  121. ** Expressions
  122. *** The Expression Class
  123. The utility functions that can be called by all children of
  124. Expression.
  125. ** Constants
  126. Constants in the Mono C# compiler are reprensented by the
  127. abstract class `Constant'. Constant is in turn derived from
  128. Expression. The base constructor for `Constant' just sets the
  129. expression class to be an `ExprClass.Value', Constants are
  130. born in a fully resolved state, so the `DoResolve' method
  131. only returns a reference to itself.
  132. Each Constant should implement the `GetValue' method which
  133. returns an object with the actual contents of this constant, a
  134. utility virtual method called `AsString' is used to render a
  135. diagnostic message. The output of AsString is shown to the
  136. developer when an error or a warning is triggered.
  137. Constant classes also participate in the constant folding
  138. process. Constant folding is invoked by those expressions
  139. that can be constant folded invoking the functionality
  140. provided by the ConstantFold class (cfold.cs).
  141. Each Constant has to implement a number of methods to convert
  142. itself into a Constant of a different type. These methods are
  143. called `ConvertToXXXX' and they are invoked by the wrapper
  144. functions `ToXXXX'. These methods only perform implicit
  145. numeric conversions. Explicit conversions are handled by the
  146. `Cast' expression class.
  147. The `ToXXXX' methods are the entry point, and provide error
  148. reporting in case a conversion can not be performed.
  149. ** Statements
  150. * The semantic analysis
  151. Hence, the compiler driver has to parse all the input files.
  152. Once all the input files have been parsed, and an internal
  153. representation of the input program exists, the following
  154. steps are taken:
  155. * The interface hierarchy is resolved first.
  156. As the interface hierarchy is constructed,
  157. TypeBuilder objects are created for each one of
  158. them.
  159. * Classes and structure hierarchy is resolved next,
  160. TypeBuilder objects are created for them.
  161. * Constants and enumerations are resolved.
  162. * Method, indexer, properties, delegates and event
  163. definitions are now entered into the TypeBuilders.
  164. * Elements that contain code are now invoked to
  165. perform semantic analysis and code generation.
  166. * Output Generation
  167. ** Code Generation
  168. The EmitContext class is created any time that IL code is to
  169. be generated (methods, properties, indexers and attributes all
  170. create EmitContexts).
  171. The EmitContext keeps track of the current namespace and type
  172. container. This is used during name resolution.
  173. An EmitContext is used by the underlying code generation
  174. facilities to track the state of code generation:
  175. * The ILGenerator used to generate code for this
  176. method.
  177. * The TypeContainer where the code lives, this is used
  178. to access the TypeBuilder.
  179. * The DeclSpace, this is used to resolve names through
  180. RootContext.LookupType in the various statements and
  181. expressions.
  182. Code generation state is also tracked here:
  183. * CheckState:
  184. This variable tracks the `checked' state of the
  185. compilation, it controls whether we should generate
  186. code that does overflow checking, or if we generate
  187. code that ignores overflows.
  188. The default setting comes from the command line
  189. option to generate checked or unchecked code plus
  190. any source code changes using the checked/unchecked
  191. statements or expressions. Contrast this with the
  192. ConstantCheckState flag.
  193. * ConstantCheckState
  194. The constant check state is always set to `true' and
  195. cant be changed from the command line. The source
  196. code can change this setting with the `checked' and
  197. `unchecked' statements and expressions.
  198. * IsStatic
  199. Whether we are emitting code inside a static or
  200. instance method
  201. * ReturnType
  202. The value that is allowed to be returned or NULL if
  203. there is no return type.
  204. * ContainerType
  205. Points to the Type (extracted from the
  206. TypeContainer) that declares this body of code
  207. summary>
  208. * IsConstructor
  209. Whether this is generating code for a constructor
  210. * CurrentBlock
  211. Tracks the current block being generated.
  212. * ReturnLabel;
  213. The location where return has to jump to return the
  214. value
  215. A few variables are used to track the state for checking in
  216. for loops, or in try/catch statements:
  217. * InFinally
  218. Whether we are in a Finally block
  219. * InTry
  220. Whether we are in a Try block
  221. * InCatch
  222. Whether we are in a Catch block
  223. * InUnsafe
  224. Whether we are inside an unsafe block