-
Notifications
You must be signed in to change notification settings - Fork 2
mattiase/xr
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
xr -- Emacs regexp parser and analyser
======================================
XR converts Emacs regular expressions to the structured rx form, thus
being an inverse of rx. It can also find mistakes and questionable
constructs inside regexp strings.
It can be useful for:
- Migrating existing code to rx form
- Understanding what a regexp string really means
- Finding errors in regexp strings
It can also parse and find mistakes in skip-sets, the regexp-like
arguments to skip-chars-forward and skip-chars-backward.
The xr package can be used interactively or by other code as a library.
* Example
(xr-pp "\\`\\(?:[^^]\\|\\^\\(?: \\*\\|\\[\\)\\)")
outputs
(seq bos
(or (not (any "^"))
(seq "^"
(or " *" "["))))
* Installation
From GNU ELPA (https://elpa.gnu.org/packages/xr.html):
M-x package-install RET xr RET
* Interface
Functions parsing regexp strings:
xr -- convert regexp to rx
xr-pp -- convert regexp to rx and pretty-print
xr-lint -- find mistakes in regexp
Functions parsing skip sets:
xr-skip-set -- convert skip-set to rx
xr-skip-set-pp -- convert skip-set to rx and pretty-print
xr-skip-set-lint -- find mistakes in skip-set
Utility:
xr-pp-rx-to-str -- pretty-print rx expression to string
* What the diagnostics mean
- Unescaped literal 'X'
A special character is taken literally because it occurs in a
position where it does not need to be backslash-escaped. It is
good style to do so anyway (assuming that it should occur as a
literal character).
- Escaped non-special character 'X'
A character is backslash-escaped even though this is not necessary
and does not turn it into a special sequence. Maybe the backslash
was in error, or should be doubled if a literal backslash was
expected.
- Duplicated 'X' inside character alternative
A character occurs twice inside [...]; this is obviously
pointless. In particular, backslashes are not special inside
[...]; they have no escaping power, and do not need to be escaped
in order to include a literal backslash.
- Repetition of repetition
- Repetition of option
- Optional repetition
- Optional option
A repetition construct is applied to an expression that is already
repeated, such as a*+ or \(x?\)?. These expressions can be written
with a single repetition and often indicate a different mistake,
perhaps a missing backslash.
When a repetition construct is ? or ??, it is termed 'option'
instead; the principle is the same.
- Reversed range 'Y-X' matches nothing
The last character of a range precedes the first and therefore
includes no characters at all (not even the endpoints). Most such
ranges are caused by a misplaced hyphen.
- Character 'B' included in range 'A-C'
- Range 'A-C' includes character 'B'
A range includes a character that also occurs individually. This
is often caused by a misplaced hyphen.
- Ranges 'A-M' and 'D-Z' overlap
Two ranges have at least one character in common. This is often
caused by a misplaced hyphen.
- Two-character range 'A-B'
A range only consists of its two endpoints, since they have
consecutive character codes. This is often caused by a misplaced
hyphen.
- Range 'A-z' between upper and lower case includes symbols
A range spans over upper and lower case letters, which also
includes some symbols. This is probably unintentional. To cover
both upper and lower case letters, use separate ranges, as in
[A-Za-z].
- Suspect character range '+-X': should '-' be literal?
A range has + as one of its endpoints, which could mean that the
hyphen was actually intended to be literal in order to match both
+ and -.
This check is only enabled when the 'checks' argument is 'all'.
- Possibly erroneous '\X' in character alternative
A character alternative includes something that looks like a
escape sequence, but no escape sequences are allowed there since
backslash is not a special character in that context.
It could also be a caused by too many backslashes.
For example, "[\\n\\t]" matches the characters 'n', 't' and
backslash, but could be an attempt to match newline and tab.
This check is only enabled when the 'checks' argument is 'all'.
- Duplicated character class '[:class:]'
A character class occurs twice in a single character alternative
or skip set.
- Or-pattern more efficiently expressed as character alternative
When an or-pattern can be written as a character alternative, it
becomes more efficient and reduces regexp stack usage.
For example, a\|b is better written [ab], and \s-\|\sw is usually
better written [[:space:][:word:]]. (There is a subtle difference
in how syntax properties are handled but it rarely matters.)
This check is only enabled when the 'checks' argument is 'all'.
- Duplicated alternative branch
The same expression occurs in two different branches, like in
A\|A. This has the effect of only including it once.
- Branch matches superset/subset of a previous branch
A branch in an or-expression matches a superset or subset of what
another branch matches, like in [ab]\|a. This means that one of
the branches can be eliminated without changing the meaning of the
regexp.
- Repetition subsumes/subsumed by preceding repetition
An repeating expression matches a superset or subset of what the
previous expression matches, in such a way that one of them is
unnecessary. For example, [ab]+a* matches the same text as [ab]+,
so the a* could be removed without changing the meaning of the
regexp.
- First/last item in repetition subsumes last/first item (wrapped)
The first and last items in a repeated sequence, being effectively
adjacent, match a superset or subset of each other, which makes
for an unexpected inefficiency. For example, \(?:a*c[ab]+\)* can
be seen as a*c[ab]+a*c[ab]+... where the [ab]+a* in the middle is
a slow way of writing [ab]+ which is made worse by the outer
repetition. The general remedy is to move the subsumed item out of
the repeated sequence, resulting in a*\(?:c[ab]+\)* in the example
above.
- Non-newline follows end-of-line anchor
- Line-start anchor follows non-newline
A pattern that does not match a newline occurs right after an
end-of-line anchor ($) or before a line-start anchor (^).
This combination can never match.
- Non-empty pattern follows end-of-text anchor
A pattern that only matches a non-empty string occurs right after
an end-of-text anchor (\'). This combination can never match.
- Use \` instead of ^ in file-matching regexp
- Use \' instead of $ in file-matching regexp
In a regexp used for matching a file name, newlines are usually
not relevant. Line-start and line-end anchors should therefore
probably be replaced with string-start and string-end,
respectively. Otherwise, the regexp may fail for file names that
do contain newlines.
- Possibly unescaped '.' in file-matching regexp
In a regexp used for matching a file name, a naked dot is usually
more likely to be a mistake (missing escaping backslash) than an
actual intent to match any character except newline, since literal
dots are very common in file name patterns.
- Uncounted repetition
The construct A\{,\} repeats A zero or more times which was
probably not intended.
- Implicit zero repetition
The construct A\{\} only matches the empty string, which was
probably not intended.
- Suspect '[' in char alternative
This warning indicates badly-placed square brackets in a character
alternative, as in [A[B]C]. A literal ] must come first
(possibly after a negating ^).
- Literal '-' not first or last
It is good style to put a literal hyphen last in character
alternatives and skip sets, to clearly indicate that it was not
intended as part of a range.
- Repetition of zero-width assertion
- Optional zero-width assertion
A repetition operator was applied to a zero-width assertion, like
^ or \<, which is completely pointless. The error may be a missing
escaping backslash.
- Repetition of expression matching an empty string
- Optional expression matching an empty string
A repetition operator was applied to a sub-expression that could
match the empty string; this is not necessarily wrong, but such
constructs run very slowly on Emacs's regexp engine. Consider
rewriting them into a form where the repeated expression cannot
match the empty string.
Example: \(?:a*b*\)* is equivalent to the much faster \(?:a\|b\)*.
Another example: \(?:a?b*\)? is better written a?b*.
In general, A?, where A matches the empty string, can be
simplified to just A.
- Repetition of effective repetition
A repetition construct is applied to an expression that itself
contains a repetition, in addition to some patterns that may match
the empty string. This can lead to bad matching performance.
Example: \(?:a*b+\)* is equivalent to the much faster \(?:a\|b\)* .
Another example: \(?:a*b+\)+ is better written a*b[ab]* .
- Possibly mistyped ':?' at start of group
A group starts as \(:? which makes it likely that it was really
meant to be \(?: -- ie, a non-capturing group.
This check is only enabled when the 'checks' argument is 'all'.
- Unnecessarily escaped 'X'
A character is backslash-escaped in a skip set despite not being
one of the three special characters - (hyphen), \ (backslash) and
^ (caret). It could be unnecessary, or a backslash that should
have been escaped.
- Single-element range 'X-X'
A range in a skip set has identical first and last elements. It is
rather pointless to have it as a range.
- Stray '\\' at end of string
A single backslash at the end of a skip set is always ignored;
double it if you want a literal backslash to be included.
- Suspect skip set framed in '[...]'
A skip set appears to be enclosed in [...], as if it were a
regexp. Skip sets are not regexps and do not use brackets. To
include the brackets themselves, put them next to each other.
- Suspect character class framed in '[...]'
A skip set contains a character class enclosed in double pairs of
square brackets, as if it were a regexp. Character classes in skip
sets are written inside a single pair of square brackets, like
[:digit:].
- Empty set matches nothing
The empty string is a skip set that does not match anything, and
is therefore pointless.
- Negated empty set matches anything
The string "^" is a skip set that matches anything, and is therefore
pointless.
* See also
The relint package (https://elpa.gnu.org/packages/relint.html) uses xr
to find regexp mistakes in elisp code.
The lex package (https://elpa.gnu.org/packages/lex.html), a lexical
analyser generator, provides the lex-parse-re function which
translates regexps to rx, but does not attempt to handle all the
edge cases of Elisp's regexp syntax or pretty-print the result.
The pcre2el package (https://github.com/joddie/pcre2el), a regexp
syntax converter and interactive regexp explainer, can also be used
for translating regexps to rx. However, xr is more accurate for this
purpose.
About
Inverse of rx: convert Emacs string regexps to rx form
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published