doc.odin 6.2 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182
  1. /*
  2. A threaded virtual machine for interpreting regular expressions.
  3. Based on the designs described by Russ Cox and attributed to both Ken Thompson and Rob Pike.
  4. The virtual machine executes all threads in lock step, i.e. the string pointer
  5. does not advance until all threads have finished processing the current rune.
  6. The algorithm does not look backwards.
  7. Threads merge when splitting or jumping to positions already visited by another
  8. thread, based on the observation that each thread having visited one PC
  9. (Program Counter) state will execute identically to the previous thread.
  10. Each thread keeps a save state of its capture groups, and thread priority is
  11. used to allow higher precedence operations to complete first with correct save
  12. states, such as greedy versus non-greedy repetition.
  13. For more information, see: https://swtch.com/~rsc/regexp/regexp2.html
  14. **Implementation Details:**
  15. - Each opcode is 8 bits in size, and most instructions have no operands.
  16. - All operands larger than `u8` are read in system endian order.
  17. - Jump and Split instructions operate on absolute positions in `u16` operands.
  18. - Classes such as `[0-9]` are stored in a RegEx-specific slice of structs which
  19. are then dereferenced by a `u8` index from the `Rune_Class` instructions.
  20. - Each Byte and Rune opcode have their operands stored inline after the opcode,
  21. sized `u8` and `i32` respectively.
  22. - A bitmap is used to determine which PC positions are occupied by a thread to
  23. perform merging. The bitmap is cleared with every new frame.
  24. - The VM supports two modes: ASCII and Unicode, decided by a compile-time
  25. boolean constant argument provided to `run`. The procedure differs only in
  26. string decoding. This was done for the sake of performance.
  27. - No allocations are ever freed; the VM expects an arena or temporary allocator
  28. to be used in the context preceding it.
  29. **Opcode Reference:**
  30. (0x00) Match
  31. The terminal opcode which ends a thread. This always comes at the end of
  32. the program.
  33. (0x01) Match_And_Exit
  34. A modified version of Match which stops the virtual machine entirely. It is
  35. only compiled for `No_Capture` expressions, as those expressions do not
  36. need to determine which thread may have saved the most appropriate capture
  37. groups.
  38. (0x02) Byte
  39. Consumes one byte from the text using its operand, which is also a byte.
  40. (0x03) Rune
  41. Consumes one Unicode codepoint from the text using its operand, which is
  42. four bytes long in a system-dependent endian order.
  43. (0x04) Rune_Class
  44. Consumes one character (which may be an ASCII byte or Unicode codepoint,
  45. wholly dependent on which mode the virtual machine is running in) from the
  46. text.
  47. The actual data storing what runes and ranges of runes apply to the class
  48. are stored alongside the program in the Regular_Expression structure and
  49. the operand for this opcode is a single byte which indexes into a
  50. collection of these data structures.
  51. (0x05) Rune_Class_Negated
  52. A modified version of Rune_Class that functions the same, save for how it
  53. returns the opposite of what Rune_Class matches.
  54. (0x06) Wildcard
  55. Consumes one byte or one Unicode codepoint, depending on the VM mode.
  56. (0x07) Jump
  57. Sets the Program Counter of a VM thread to the operand, which is a u16.
  58. This opcode is used to implement Alternation (coming at the end of the left
  59. choice) and Repeat_Zero (to cause the thread to loop backwards).
  60. (0x08) Split
  61. Spawns a new thread for the X operand and causes the current thread to jump
  62. to the Y operand. This opcode is used to implement Alternation, all the
  63. Repeat variations, and the Optional nodes.
  64. Splitting threads is how the virtual machine is able to execute optional
  65. control flow paths, letting it evaluate different possible ways to match
  66. text.
  67. (0x09) Save
  68. Saves the current string index to a slot on the thread dictated by the
  69. operand. These values will be used later to reconstruct capture groups.
  70. (0x0A) Assert_Start
  71. Asserts that the thread is at the beginning of the string.
  72. (0x0B) Assert_Start_Multiline
  73. This opcode is compiled in only when the `Multiline` flag is present as a
  74. replacement for the `^` text anchor.
  75. Asserts that the thread is at the beginning of the string or previously
  76. parsed either a "\n" or "\r".
  77. (0x0C) Assert_End
  78. Asserts that the thread is at the end of the string.
  79. (0x0D) Assert_Word_Boundary
  80. Asserts that the thread is on a word boundary, which can be the start or
  81. end of the text. This examines both the current rune and the next rune.
  82. (0x0E) Assert_Non_Word_Boundary
  83. A modified version of Assert_Word_Boundary that returns the opposite value.
  84. (0x0F) Multiline_Open
  85. This opcode is compiled in only when the `Multiline` flag is present as a
  86. replacement for the `$` text anchor.
  87. It asserts that either the current thread is at the end of the string,
  88. or it consumes a `\n` or `\r` character.
  89. If a `\r` character is consumed, the PC will be advanced to the sibling
  90. `Multiline_Close` opcode to optionally consume a `\n` character on the next
  91. frame.
  92. (0x10) Multiline_Close
  93. This opcode is always present after `Multiline_Open`.
  94. It handles consuming the second half of a complete newline, if necessary.
  95. For example, Windows newlines are represented by the characters `\r\n`,
  96. whereas UNIX newlines are `\n` and Macintosh newlines are `\r`.
  97. (0x11) Wait_For_Byte
  98. (0x12) Wait_For_Rune
  99. (0x13) Wait_For_Rune_Class
  100. (0x14) Wait_For_Rune_Class_Negated
  101. These opcodes are an optimization around restarting threads on failed
  102. matches when the beginning to a pattern is predictable and the Global flag
  103. is set.
  104. They will cause the VM to wait for the next rune to match before splitting,
  105. as would happen in the un-optimized version.
  106. (0x15) Match_All_And_Escape
  107. This opcode is an optimized version of `.*$` or `.+$` that causes the
  108. active thread to immediately work on escaping the program by following all
  109. Jumps out to the end.
  110. While running through the rest of the program, the thread will trigger on
  111. every Save instruction it passes to store the length of the string.
  112. This way, any time a program hits one of these `.*$` constructs, the
  113. virtual machine can exit early, vastly improving processing times.
  114. Be aware, this opcode is not compiled in if the `Multiline` flag is on, as
  115. the meaning of `$` changes with that flag.
  116. */
  117. package regex_vm