123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236 |
- .TH PCRE2COMPAT 3 "30 November 2023" "PCRE2 10.43"
- .SH NAME
- PCRE2 - Perl-compatible regular expressions (revised API)
- .SH "DIFFERENCES BETWEEN PCRE2 AND PERL"
- .rs
- .sp
- This document describes some of the known differences in the ways that PCRE2
- and Perl handle regular expressions. The differences described here are with
- respect to Perl version 5.38.0, but as both Perl and PCRE2 are continually
- changing, the information may at times be out of date.
- .P
- 1. When PCRE2_DOTALL (equivalent to Perl's /s qualifier) is not set, the
- behaviour of the '.' metacharacter differs from Perl. In PCRE2, '.' matches the
- next character unless it is the start of a newline sequence. This means that,
- if the newline setting is CR, CRLF, or NUL, '.' will match the code point LF
- (0x0A) in ASCII/Unicode environments, and NL (either 0x15 or 0x25) when using
- EBCDIC. In Perl, '.' appears never to match LF, even when 0x0A is not a newline
- indicator.
- .P
- 2. PCRE2 has only a subset of Perl's Unicode support. Details of what it does
- have are given in the
- .\" HREF
- \fBpcre2unicode\fP
- .\"
- page.
- .P
- 3. Like Perl, PCRE2 allows repeat quantifiers on parenthesized assertions, but
- they do not mean what you might think. For example, (?!a){3} does not assert
- that the next three characters are not "a". It just asserts that the next
- character is not "a" three times (in principle; PCRE2 optimizes this to run the
- assertion just once). Perl allows some repeat quantifiers on other assertions,
- for example, \eb* , but these do not seem to have any use. PCRE2 does not allow
- any kind of quantifier on non-lookaround assertions.
- .P
- 4. If a braced quantifier such as {1,2} appears where there is nothing to
- repeat (for example, at the start of a branch), PCRE2 raises an error whereas
- Perl treats the quantifier characters as literal.
- .P
- 5. Capture groups that occur inside negative lookaround assertions are counted,
- but their entries in the offsets vector are set only when a negative assertion
- is a condition that has a matching branch (that is, the condition is false).
- Perl may set such capture groups in other circumstances.
- .P
- 6. The following Perl escape sequences are not supported: \eF, \el, \eL, \eu,
- \eU, and \eN when followed by a character name. \eN on its own, matching a
- non-newline character, and \eN{U+dd..}, matching a Unicode code point, are
- supported. The escapes that modify the case of following letters are
- implemented by Perl's general string-handling and are not part of its pattern
- matching engine. If any of these are encountered by PCRE2, an error is
- generated by default. However, if either of the PCRE2_ALT_BSUX or
- PCRE2_EXTRA_ALT_BSUX options is set, \eU and \eu are interpreted as ECMAScript
- interprets them.
- .P
- 7. The Perl escape sequences \ep, \eP, and \eX are supported only if PCRE2 is
- built with Unicode support (the default). The properties that can be tested
- with \ep and \eP are limited to the general category properties such as Lu and
- Nd, the derived properties Any and LC (synonym L&), script names such as Greek
- or Han, Bidi_Class, Bidi_Control, and a few binary properties. Both PCRE2 and
- Perl support the Cs (surrogate) property, but in PCRE2 its use is limited. See
- the
- .\" HREF
- \fBpcre2pattern\fP
- .\"
- documentation for details. The long synonyms for property names that Perl
- supports (such as \ep{Letter}) are not supported by PCRE2, nor is it permitted
- to prefix any of these properties with "Is".
- .P
- 8. PCRE2 supports the \eQ...\eE escape for quoting substrings. Characters
- in between are treated as literals. However, this is slightly different from
- Perl in that $ and @ are also handled as literals inside the quotes. In Perl,
- they cause variable interpolation (PCRE2 does not have variables). Also, Perl
- does "double-quotish backslash interpolation" on any backslashes between \eQ
- and \eE which, its documentation says, "may lead to confusing results". PCRE2
- treats a backslash between \eQ and \eE just like any other character. Note the
- following examples:
- .sp
- Pattern PCRE2 matches Perl matches
- .sp
- .\" JOIN
- \eQabc$xyz\eE abc$xyz abc followed by the
- contents of $xyz
- \eQabc\e$xyz\eE abc\e$xyz abc\e$xyz
- \eQabc\eE\e$\eQxyz\eE abc$xyz abc$xyz
- \eQA\eB\eE A\eB A\eB
- \eQ\e\eE \e \e\eE
- .sp
- The \eQ...\eE sequence is recognized both inside and outside character classes
- by both PCRE2 and Perl.
- .P
- 9. Fairly obviously, PCRE2 does not support the (?{code}) and (??{code})
- constructions. However, PCRE2 does have a "callout" feature, which allows an
- external function to be called during pattern matching. See the
- .\" HREF
- \fBpcre2callout\fP
- .\"
- documentation for details.
- .P
- 10. Subroutine calls (whether recursive or not) were treated as atomic groups
- up to PCRE2 release 10.23, but from release 10.30 this changed, and
- backtracking into subroutine calls is now supported, as in Perl.
- .P
- 11. In PCRE2, if any of the backtracking control verbs are used in a group that
- is called as a subroutine (whether or not recursively), their effect is
- confined to that group; it does not extend to the surrounding pattern. This is
- not always the case in Perl. In particular, if (*THEN) is present in a group
- that is called as a subroutine, its action is limited to that group, even if
- the group does not contain any | characters. Note that such groups are
- processed as anchored at the point where they are tested.
- .P
- 12. If a pattern contains more than one backtracking control verb, the first
- one that is backtracked onto acts. For example, in the pattern
- A(*COMMIT)B(*PRUNE)C a failure in B triggers (*COMMIT), but a failure in C
- triggers (*PRUNE). Perl's behaviour is more complex; in many cases it is the
- same as PCRE2, but there are cases where it differs.
- .P
- 13. There are some differences that are concerned with the settings of captured
- strings when part of a pattern is repeated. For example, matching "aba" against
- the pattern /^(a(b)?)+$/ in Perl leaves $2 unset, but in PCRE2 it is set to
- "b".
- .P
- 14. PCRE2's handling of duplicate capture group numbers and names is not as
- general as Perl's. This is a consequence of the fact the PCRE2 works internally
- just with numbers, using an external table to translate between numbers and
- names. In particular, a pattern such as (?|(?<a>A)|(?<b>B)), where the two
- capture groups have the same number but different names, is not supported, and
- causes an error at compile time. If it were allowed, it would not be possible
- to distinguish which group matched, because both names map to capture group
- number 1. To avoid this confusing situation, an error is given at compile time.
- .P
- 15. Perl used to recognize comments in some places that PCRE2 does not, for
- example, between the ( and ? at the start of a group. If the /x modifier is
- set, Perl allowed white space between ( and ? though the latest Perls give an
- error (for a while it was just deprecated). There may still be some cases where
- Perl behaves differently.
- .P
- 16. Perl, when in warning mode, gives warnings for character classes such as
- [A-\ed] or [a-[:digit:]]. It then treats the hyphens as literals. PCRE2 has no
- warning features, so it gives an error in these cases because they are almost
- certainly user mistakes.
- .P
- 17. In PCRE2, the upper/lower case character properties Lu and Ll are not
- affected when case-independent matching is specified. For example, \ep{Lu}
- always matches an upper case letter. I think Perl has changed in this respect;
- in the release at the time of writing (5.38), \ep{Lu} and \ep{Ll} match all
- letters, regardless of case, when case independence is specified.
- .P
- 18. From release 5.32.0, Perl locks out the use of \eK in lookaround
- assertions. From release 10.38 PCRE2 does the same by default. However, there
- is an option for re-enabling the previous behaviour. When this option is set,
- \eK is acted on when it occurs in positive assertions, but is ignored in
- negative assertions.
- .P
- 19. PCRE2 provides some extensions to the Perl regular expression facilities.
- Perl 5.10 included new features that were not in earlier versions of Perl, some
- of which (such as named parentheses) were in PCRE2 for some time before. This
- list is with respect to Perl 5.38:
- .sp
- (a) If PCRE2_DOLLAR_ENDONLY is set and PCRE2_MULTILINE is not set, the $
- meta-character matches only at the very end of the string.
- .sp
- (b) A backslash followed by a letter with no special meaning is faulted. (Perl
- can be made to issue a warning.)
- .sp
- (c) If PCRE2_UNGREEDY is set, the greediness of the repetition quantifiers is
- inverted, that is, by default they are not greedy, but if followed by a
- question mark they are.
- .sp
- (d) PCRE2_ANCHORED can be used at matching time to force a pattern to be tried
- only at the first matching position in the subject string.
- .sp
- (e) The PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY and PCRE2_NOTEMPTY_ATSTART
- options have no Perl equivalents.
- .sp
- (f) The \eR escape sequence can be restricted to match only CR, LF, or CRLF
- by the PCRE2_BSR_ANYCRLF option.
- .sp
- (g) The callout facility is PCRE2-specific. Perl supports codeblocks and
- variable interpolation, but not general hooks on every match.
- .sp
- (h) The partial matching facility is PCRE2-specific.
- .sp
- (i) The alternative matching function (\fBpcre2_dfa_match()\fP matches in a
- different way and is not Perl-compatible.
- .sp
- (j) PCRE2 recognizes some special sequences such as (*CR) or (*NO_JIT) at
- the start of a pattern. These set overall options that cannot be changed within
- the pattern.
- .sp
- (k) PCRE2 supports non-atomic positive lookaround assertions. This is an
- extension to the lookaround facilities. The default, Perl-compatible
- lookarounds are atomic.
- .sp
- (l) There are three syntactical items in patterns that can refer to a capturing
- group by number: back references such as \eg{2}, subroutine calls such as (?3),
- and condition references such as (?(4)...). PCRE2 supports relative group
- numbers such as +2 and -4 in all three cases. Perl supports both plus and minus
- for subroutine calls, but only minus for back references, and no relative
- numbering at all for conditions.
- .P
- 20. Perl has different limits than PCRE2. See the
- .\" HREF
- \fBpcre2limit\fP
- .\"
- documentation for details. Perl went with 5.10 from recursion to iteration
- keeping the intermediate matches on the heap, which is ~10% slower but does not
- fall into any stack-overflow limit. PCRE2 made a similar change at release
- 10.30, and also has many build-time and run-time customizable limits.
- .P
- 21. Unlike Perl, PCRE2 doesn't have character set modifiers and specially no way
- to set characters by context just like Perl's "/d". A regular expression using
- PCRE2_UTF and PCRE2_UCP will use similar rules to Perl's "/u"; something closer
- to "/a" could be selected by adding other PCRE2_EXTRA_ASCII* options on top.
- .P
- 22. Some recursive patterns that Perl diagnoses as infinite recursions can be
- handled by PCRE2, either by the interpreter or the JIT. An example is
- /(?:|(?0)abcd)(?(R)|\ez)/, which matches a sequence of any number of repeated
- "abcd" substrings at the end of the subject.
- .
- .
- .SH AUTHOR
- .rs
- .sp
- .nf
- Philip Hazel
- Retired from University Computing Service
- Cambridge, England.
- .fi
- .
- .
- .SH REVISION
- .rs
- .sp
- .nf
- Last updated: 30 November 2023
- Copyright (c) 1997-2023 University of Cambridge.
- .fi
|