12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697 |
- /*
- package regex implements a complete suite for using Regular Expressions to
- match and capture text.
- Regular expressions are used to describe how a piece of text can match to
- another, using a pattern language.
- Odin's regex library implements the following features:
- Alternation: `apple|cherry`
- Classes: `[0-9_]`
- Classes, negated: `[^0-9_]`
- Shorthands: `\d\s\w`
- Shorthands, negated: `\D\S\W`
- Wildcards: `.`
- Repeat, optional: `a*`
- Repeat, at least once: `a+`
- Repetition: `a{1,2}`
- Optional: `a?`
- Group, capture: `([0-9])`
- Group, non-capture: `(?:[0-9])`
- Start & End Anchors: `^hello$`
- Word Boundaries: `\bhello\b`
- Non-Word Boundaries: `hello\B`
- These specifiers can be composed together, such as an optional group:
- `(?:hello)?`
- This package also supports the non-greedy variants of the repeating and
- optional specifiers by appending a `?` to them.
- Of the shorthand classes that are supported, they are all ASCII-based, even
- when compiling in Unicode mode. This is for the sake of general performance and
- simplicity, as there are thousands of Unicode codepoints which would qualify as
- either a digit, space, or word character which could be irrelevant depending on
- what is being matched.
- Here are the shorthand class equivalencies:
- \d: [0-9]
- \s: [\t\n\f\r ]
- \w: [0-9A-Z_a-z]
- If you need your own shorthands, you can compose strings together like so:
- MY_HEX :: "[0-9A-Fa-f]"
- PATTERN :: MY_HEX + "-" + MY_HEX
- The compiler will handle turning multiple identical classes into references to
- the same set of matching runes, so there's no penalty for doing it like this.
- ``Some people, when confronted with a problem, think
- "I know, I'll use regular expressions." Now they have two problems.''
- - Jamie Zawinski
- Regular expressions have gathered a reputation over the decades for often being
- chosen as the wrong tool for the job. Here, we will clarify a few cases in
- which RegEx might be good or bad.
- **When is it a good time to use RegEx?**
- - You don't know at compile-time what patterns of text the program will need to
- match when it's running.
- - As an example, you are making a client which can be configured by the user to
- trigger on certain text patterns received from a server.
- - For another example, you need a way for users of a text editor to compose
- matching strings that are more intricate than a simple substring lookup.
- - The text you're matching against is small (< 64 KiB) and your patterns aren't
- overly complicated with branches (alternations, repeats, and optionals).
- - If none of the above general impressions apply but your project doesn't
- warrant long-term maintenance.
- **When is it a bad time to use RegEx?**
- - You know at compile-time the grammar you're parsing; a hand-made parser has
- the potential to be more maintainable and readable.
- - The grammar you're parsing has certain validation steps that lend itself to
- forming complicated expressions, such as e-mail addresses, URIs, dates,
- postal codes, credit cards, et cetera. Using RegEx to validate these
- structures is almost always a bad sign.
- - The text you're matching against is big (> 1 MiB); you would be better served
- by first dividing the text into manageable chunks and using some heuristic to
- locate the most likely location of a match before applying RegEx against it.
- - You value high performance and low memory usage; RegEx will always have a
- certain overhead which increases with the complexity of the pattern.
- The implementation of this package has been optimized, but it will never be as
- thoroughly performant as a hand-made parser. In comparison, there are just too
- many intermediate steps, assumptions, and generalizations in what it takes to
- handle a regular expression.
- */
- package regex
|