Skip to content

Conversation

@carenas
Copy link
Contributor

@carenas carenas commented Dec 27, 2023

Change -P so it will by default only include the ASCII digits when \d is used in a expresion, as it performs better and it is what the user wanter (most likely).

For consistency also do the same for [:digit:] and make sure this way that the recommendation for Unicode handling of Regex (TR18) is followed more closely (considering that both \d and [:digit:] match the POSIX definition of digits)

majed08

This comment was marked as off-topic.

majed08

This comment was marked as off-topic.

Since acabd20 (grep: correctly identify utf-8 characters with
\{b,w} in -P, 2023-01-08), PCRE2's UCP mode has been enabled when
UTF content was expected and therefore multibyte characters are
included as part of all character classes (when defined).

note that if the locale used is not UTF enabled (specially: C, POSIX)
or uses an extended charmap binary that is not unicode compatible, binary
match will be used instead.

It was argued that doing so, at least for \d, was not a good idea,
as that might not be what the user expected based on its historical
meaning and was also slower, and indeed a similar change that was done
to GNU grep was reverted and required further tweaks.

At that time, PCRE2 didn't have a way to disable UCP's character
expansion selectively, but flags to do so, and that will be
available in the next release (planned soon) were added, and
one of them has been in use by GNU grep since their last release
in May (only if built and linked against the prereleased PCRE2
library though).

Add flags to make sure that both \d and [:digit:] only include
ASCII digits so that `git grep` will be closer to GNU grep and
improve performance as a side effect, but add a configuration flag
to allow keeping the current behaviour (which is closer to perl
and ripgrep).

Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants