| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549 |
- The Internals of the Mono C# Compiler
-
- Miguel de Icaza
- ([email protected])
- 2002
- * Abstract
- The Mono C# compiler is a C# compiler written in C# itself.
- Its goals are to provide a free and alternate implementation
- of the C# language. The Mono C# compiler generates ECMA CIL
- images through the use of the System.Reflection.Emit API which
- enable the compiler to be platform independent.
-
- * Overview: How the compiler fits together
- The compilation process is managed by the compiler driver (it
- lives in driver.cs).
- The compiler reads a set of C# source code files, and parses
- them. Any assemblies or modules that the user might want to
- use with his project are loaded after parsing is done.
- Once all the files have been parsed, the type hierarchy is
- resolved. First interfaces are resolved, then types and
- enumerations.
- Once the type hierarchy is resolved, every type is populated:
- fields, methods, indexers, properties, events and delegates
- are entered into the type system.
- At this point the program skeleton has been completed. The
- next process is to actually emit the code for each of the
- executable methods. The compiler drives this from
- RootContext.EmitCode.
- Each type then has to populate its methods: populating a
- method requires creating a structure that is used as the state
- of the block being emitted (this is the EmitContext class) and
- then generating code for the topmost statement (the Block).
- Code generation has two steps: the first step is the semantic
- analysis (Resolve method) that resolves any pending tasks, and
- guarantees that the code is correct. The second phase is the
- actual code emission. All errors are flagged during in the
- "Resolution" process.
- After all code has been emitted, then the compiler closes all
- the types (this basically tells the Reflection.Emit library to
- finish up the types), resources, and definition of the entry
- point are done at this point, and the output is saved to
- disk.
- The following list will give you an idea of where the
- different pieces of the compiler live:
- Infrastructure:
- driver.cs:
- This drives the compilation process: loading of
- command line options; parsing the inputs files;
- loading the referenced assemblies; resolving the type
- hierarchy and emitting the code.
- codegen.cs:
-
- The state tracking for code generation.
- attribute.cs:
- Code to do semantic analysis and emit the attributes
- is here.
- rootcontext.cs:
- Keeps track of the types defined in the source code,
- as well as the assemblies loaded.
- typemanager.cs:
- This contains the MCS type system.
- report.cs:
- Error and warning reporting methods.
- support.cs:
- Assorted utility functions used by the compiler.
-
- Parsing
- cs-tokenizer.cs:
- The tokenizer for the C# language, it includes also
- the C# pre-processor.
- cs-parser.jay, cs-parser.cs:
- The parser is implemented using a C# port of the Yacc
- parser. The parser lives in the cs-parser.jay file,
- and cs-parser.cs is the generated parser.
- location.cs:
- The `location' structure is a compact representation
- of a file, line, column where a token, or a high-level
- construct appears. This is used to report errors.
- Expressions:
-
- ecore.cs
-
- Basic expression classes, and interfaces most shared
- code and static methods are here.
- expression.cs:
- Most of the different kinds of expressions classes
- live in this file.
- assign.cs:
- The assignment expression got its own file.
- constant.cs:
- The classes that represent the constant expressions.
- literal.cs
-
- Literals are constants that have been entered manually
- in the source code, like `1' or `true'. The compiler
- needs to tell constants from literals apart during the
- compilation process, as literals sometimes have some
- implicit extra conversions defined for them.
- cfold.cs:
- The constant folder for binary expressions.
- Statements
- statement.cs:
- All of the abstract syntax tree elements for
- statements live in this file. This also drives the
- semantic analysis process.
- Declarations, Classes, Structs, Enumerations
- decl.cs
- This contains the base class for Members and
- Declaration Spaces. A declaration space introduces
- new names in types, so classes, structs, delegates and
- enumerations derive from it.
- class.cs:
-
- Methods for holding and defining class and struct
- information, and every member that can be in these
- (methods, fields, delegates, events, etc).
- The most interesting type here is the `TypeContainer'
- which is a derivative of the `DeclSpace'
- delegate.cs:
- Handles delegate definition and use.
- enum.cs:
- Handles enumerations.
- interface.cs:
- Holds and defines interfaces. All the code related to
- interface declaration lives here.
- parameter.cs:
- During the parsing process, the compiler encapsulates
- parameters in the Parameter and Parameters classes.
- These classes provide definition and resolution tools
- for them.
- pending.cs:
- Routines to track pending implementations of abstract
- methods and interfaces. These are used by the
- TypeContainer-derived classes to track whether every
- method required is implemented.
-
- * The parsing process
- All the input files that make up a program need to be read in
- advance, because C# allows declarations to happen after an
- entity is used, for example, the following is a valid program:
- class X : Y {
- static void Main ()
- {
- a = "hello"; b = "world";
- }
- string a;
- }
-
- class Y {
- public string b;
- }
- At the time the assignment expression `a = "hello"' is parsed,
- it is not know whether a is a class field from this class, or
- its parents, or whether it is a property access or a variable
- reference. The actual meaning of `a' will not be discvored
- until the semantic analysis phase.
- ** The Tokenizer and the pre-processor
- The tokenizer is contained in the file `cs-tokenizer.cs', and
- the main entry point is the `token ()' method. The tokenizer
- implements the `yyParser.yyInput' interface, which is what the
- Yacc/Jay parser will use when fetching tokens.
- Token definitions are generated by jay during the compilation
- process, and those can be references from the tokenizer class
- with the `Token.' prefix.
- Each time a token is returned, the location for the token is
- recorded into the `Location' property, that can be accessed by
- the parser. The parser retrieves the Location properties as
- it builds its internal representation to allow the semantic
- analysis phase to produce error messages that can pin point
- the location of the problem.
- Some tokens have values associated with it, for example when
- the tokenizer encounters a string, it will return a
- LITERAL_STRING token, and the actual string parsed will be
- available in the `Value' property of the tokenizer. The same
- mechanism is used to return integers and floating point
- numbers.
- C# has a limited pre-processor that allows conditional
- compilation, but it is not as fully featured as the C
- pre-processor, and most notably, macros are missing. This
- makes it simple to implement in very few lines and mesh it
- with the tokenizer.
- The `handle_preprocessing_directive' method in the tokenizer
- handles all the pre-processing, and it is invoked when the '#'
- symbol is found as the first token in a line.
- The state of the pre-processor is contained in a Stack called
- `ifstack', this state is used to track the if/elif/else/endif
- nesting and the current state. The state is encoded in the
- top of the stack as a number of values `TAKING',
- `TAKEN_BEFORE', `ELSE_SEEN', `PARENT_TAKING'.
- ** Locations
- Locations are encoded as a 32-bit number (the Location
- struct) that map each input source line to a linear number.
- As new files are parsed, the Location manager is informed of
- the new file, to allow it to map back from an int constant to
- a file + line number.
- The tokenizer also tracks the column number for a token, but
- this is currently not being used or encoded. It could
- probably be encoded in the low 9 bits, allowing for columns
- from 1 to 512 to be encoded.
- * The Parser
- The parser is written using Jay, which is a port of Berkeley
- Yacc to Java, that I later ported to C#.
- Many people ask why the grammar of the parser does not match
- exactly the definition in the C# specification. The reason is
- simple: the grammar in the C# specification is designed to be
- consumed by humans, and not by a computer program. Before
- you can feed this grammar to a tool, it needs to be simplified
- to allow the tool to generate a correct parser for it.
- In the Mono C# compiler, we use a class for each of the
- statements and expressions in the C# language. For example,
- there is a `While' class for the the `while' statement, a
- `Cast' class to represent a cast expression and so on.
- There is a Statement class, and an Expression class which are
- the base classes for statements and expressions.
- ** Namespaces
-
- Using list.
- * Internal Representation
- ** Expressions
- Expressions in the Mono C# compiler are represented by the
- `Expression' class. This is an abstract class that particular
- kinds of expressions have to inherit from and override a few
- methods.
- The base Expression class contains two fields: `eclass' which
- represents the "expression classification" (from the C#
- specs) and the type of the expression.
- Expressions have to be resolved before they are can be used.
- The resolution process is implemented by overriding the
- `DoResolve' method. The DoResolve method has to set the
- `eclass' field and the `type', perform all error checking and
- computations that will be required for code generation at this
- stage.
- The return value from DoResolve is an expression. Most of the
- time an Expression derived class will return itself (return
- this) when it will handle the emission of the code itself, or
- it can return a new Expression.
- For example, the parser will create an "ElementAccess" class
- for:
- a [0] = 1;
- During the resolution process, the compiler will know whether
- this is an array access, or an indexer access. And will
- return either an ArrayAccess expression or an IndexerAccess
- expression from DoResolve.
- *** The Expression Class
- The utility functions that can be called by all children of
- Expression.
- ** Constants
- Constants in the Mono C# compiler are reprensented by the
- abstract class `Constant'. Constant is in turn derived from
- Expression. The base constructor for `Constant' just sets the
- expression class to be an `ExprClass.Value', Constants are
- born in a fully resolved state, so the `DoResolve' method
- only returns a reference to itself.
- Each Constant should implement the `GetValue' method which
- returns an object with the actual contents of this constant, a
- utility virtual method called `AsString' is used to render a
- diagnostic message. The output of AsString is shown to the
- developer when an error or a warning is triggered.
- Constant classes also participate in the constant folding
- process. Constant folding is invoked by those expressions
- that can be constant folded invoking the functionality
- provided by the ConstantFold class (cfold.cs).
- Each Constant has to implement a number of methods to convert
- itself into a Constant of a different type. These methods are
- called `ConvertToXXXX' and they are invoked by the wrapper
- functions `ToXXXX'. These methods only perform implicit
- numeric conversions. Explicit conversions are handled by the
- `Cast' expression class.
- The `ToXXXX' methods are the entry point, and provide error
- reporting in case a conversion can not be performed.
- ** Constant Folding
- The C# language requires constant folding to be implemented.
- Constant folding is hooked up in the Binary.Resolve method.
- If both sides of a binary expression are constants, then the
- ConstantFold.BinaryFold routine is invoked.
- This routine implements all the binary operator rules, it
- is a mirror of the code that generates code for binary
- operators, but that has to be evaluated at runtime.
- If the constants can be folded, then a new constant expression
- is returned, if not, then the null value is returned (for
- example, the concatenation of a string constant and a numeric
- constant is deferred to the runtime).
- ** Side effects
- a [i++]++
- a [i++] += 5;
- ** Statements
- * The semantic analysis
- Hence, the compiler driver has to parse all the input files.
- Once all the input files have been parsed, and an internal
- representation of the input program exists, the following
- steps are taken:
- * The interface hierarchy is resolved first.
- As the interface hierarchy is constructed,
- TypeBuilder objects are created for each one of
- them.
- * Classes and structure hierarchy is resolved next,
- TypeBuilder objects are created for them.
- * Constants and enumerations are resolved.
- * Method, indexer, properties, delegates and event
- definitions are now entered into the TypeBuilders.
- * Elements that contain code are now invoked to
- perform semantic analysis and code generation.
- * Output Generation
- ** Code Generation
- The EmitContext class is created any time that IL code is to
- be generated (methods, properties, indexers and attributes all
- create EmitContexts).
- The EmitContext keeps track of the current namespace and type
- container. This is used during name resolution.
- An EmitContext is used by the underlying code generation
- facilities to track the state of code generation:
- * The ILGenerator used to generate code for this
- method.
- * The TypeContainer where the code lives, this is used
- to access the TypeBuilder.
- * The DeclSpace, this is used to resolve names through
- RootContext.LookupType in the various statements and
- expressions.
-
- Code generation state is also tracked here:
- * CheckState:
- This variable tracks the `checked' state of the
- compilation, it controls whether we should generate
- code that does overflow checking, or if we generate
- code that ignores overflows.
-
- The default setting comes from the command line
- option to generate checked or unchecked code plus
- any source code changes using the checked/unchecked
- statements or expressions. Contrast this with the
- ConstantCheckState flag.
- * ConstantCheckState
-
- The constant check state is always set to `true' and
- cant be changed from the command line. The source
- code can change this setting with the `checked' and
- `unchecked' statements and expressions.
-
- * IsStatic
-
- Whether we are emitting code inside a static or
- instance method
-
- * ReturnType
-
- The value that is allowed to be returned or NULL if
- there is no return type.
-
-
- * ContainerType
-
- Points to the Type (extracted from the
- TypeContainer) that declares this body of code
- summary>
-
-
- * IsConstructor
-
- Whether this is generating code for a constructor
- * CurrentBlock
- Tracks the current block being generated.
- * ReturnLabel;
-
- The location where return has to jump to return the
- value
- A few variables are used to track the state for checking in
- for loops, or in try/catch statements:
- * InFinally
-
- Whether we are in a Finally block
- * InTry
- Whether we are in a Try block
- * InCatch
-
- Whether we are in a Catch block
- * InUnsafe
- Whether we are inside an unsafe block
-
- * Miscelaneous
- ** Error Processing.
- Errors are reported during the various stages of the
- compilation process. The compiler stops its processing if
- there are errors between the various phases. This simplifies
- the code, because it is safe to assume always that the data
- structures that the compiler is operating on are always
- consistent.
- The error codes in the Mono C# compiler are the same as those
- found in the Microsoft C# compiler, with a few exceptions
- (where we report a few more errors, those are documented in
- mcs/errors/errors.txt). The goal is to reduce confussion to
- the users, and also to help us track the progress of the
- compiler in terms of the errors we report.
- The Report class provides error and warning display functions,
- and also keeps an error count which is used to stop the
- compiler between the phases.
- A couple of debugging tools are available here, and are useful
- when extending or fixing bugs in the compiler. If the
- `--fatal' flag is passed to the compiler, the Report.Error
- routine will throw an exception. This can be used to pinpoint
- the location of the bug and examine the variables around the
- error location.
- Warnings can be turned into errors by using the `--werror'
- flag to the compiler.
- The report class also ignores warnings that have been
- specified on the command line with the `--nowarn' flag.
- Finally, code in the compiler uses the global variable
- RootContext.WarningLevel in a few places to decide whether a
- warning is worth reporting to the user or not.
|