-
Notifications
You must be signed in to change notification settings - Fork 278
perf: faster regex on Python 3.11+ #988
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
f4f85c5 to
c33d6d4
Compare
|
(CC @mattip) PyPy isn't working the same way as CPython with the new (in 3.11) atomics and possessives: $ uv run --python=3.11 python
Using CPython 3.11.14
Removed virtual environment at: .venv
Creating virtual environment at: .venv
Installed 9 packages in 16ms
Python 3.11.14 (main, Nov 19 2025, 23:12:58) [Clang 21.1.4 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from packaging.version import VERSION_PATTERN
>>> import re
>>> _regex = re.compile(VERSION_PATTERN, re.VERBOSE | re.IGNORECASE)
>>> version = "1!3.0.0.rc2"
>>> _regex.fullmatch(version).groups()
('1', '3.0.0', '.rc2', 'rc', '2', None, None, None, None, None, None, None, None)
$ uv run --python=pypy3.11 python
Using PyPy 3.11.13
Removed virtual environment at: .venv
Creating virtual environment at: .venv
Installed 9 packages in 13ms
Python 3.11.13 (413c9b7f57f5, Jul 03 2025, 18:04:06)
[PyPy 7.3.20 with GCC Apple LLVM 16.0.0 (clang-1600.0.26.6)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>> from packaging.version import VERSION_PATTERN
>>>> import re
>>>> _regex = re.compile(VERSION_PATTERN, re.VERBOSE | re.IGNORECASE)
>>>> version = "1!3.0.0.rc2"
>>>> _regex.fullmatch(version).groups()
('1', '3.0.0.', 'rc2', None, None, None, None, None, None, None, None, None, None)Small MWE of this do seem to be working. |
609b1e5 to
3bf4a59
Compare
What was the regex when they were different? |
|
The current regex still differs: _VERSION_PATTERN = r"""
v?+ # optional leading v
(?:
(?:(?P<epoch>[0-9]+)!)? # epoch
(?P<release>[0-9]++(?:\.[0-9]++)*+) # release segment
(?P<pre> # pre-release
[._-]?+
(?P<pre_l>alpha|a|beta|b|preview|pre|c|rc)
[._-]?+
(?P<pre_n>[0-9]++)?
)?
(?P<post> # post release
(?:-(?P<post_n1>[0-9]+))
|
(?:
[._-]?
(?P<post_l>post|rev|r)
[._-]?
(?P<post_n2>[0-9]+)?
)
)?
(?P<dev> # dev release
[._-]?+
(?P<dev_l>dev)
[._-]?+
(?P<dev_n>[0-9]++)?
)?
)
(?:\+ # local version
(?P<local>
[a-z0-9]++
(?:[._-][a-z0-9]++)*+
)
)?
"""If you remove just one possessive, then it almost works, but still differs in one test: diff --git a/src/packaging/version.py b/src/packaging/version.py
index 6a07c2b..3abff64 100644
--- a/src/packaging/version.py
+++ b/src/packaging/version.py
@@ -119,7 +119,7 @@ _VERSION_PATTERN = r"""
v?+ # optional leading v
(?:
(?:(?P<epoch>[0-9]+)!)? # epoch
- (?P<release>[0-9]++(?:\.[0-9]++)*+) # release segment
+ (?P<release>[0-9]++(?:\.[0-9]+)*+) # release segment
(?P<pre> # pre-release
[._-]?+
(?P<pre_l>alpha|a|beta|b|preview|pre|c|rc) |
|
If I remove all instances of |
811ea93 to
9137a14
Compare
|
The nested possessives don't really affect performance, so the current version is quite fast. I've added the failing example to my perf test. |
Thanks. It would be nice if you could reduce it to a smaller reproducer and open an issue at https://github.com/pypy/pypy/issues. Although it doesn't look like there will be another PyPy release at this point, so don't go to a lot of effort. |
9137a14 to
b91cc16
Compare
b91cc16 to
ee37416
Compare
|
Testing this on every version ever uploaded to PyPI, I get identical valid/invalid versions count, and get a 10% speedup on Python 3.13: Before: After: (PyPy3.11 takes 25 seconds) tasks/benchmark_versions.py:# benchmark_versions.py
import sqlite3
import timeit
from packaging.version import Version, InvalidVersion
# Get data with:
# curl -L https://github.com/pypi-data/pypi-json-data/releases/download/latest/pypi-data.sqlite.gz | gzip -d > pypi-data.sqlite
def valid_version(v: str) -> bool:
try:
Version(v)
except InvalidVersion:
return False
return True
with sqlite3.connect("pypi-data.sqlite") as conn:
TEST_ALL_VERSIONS = [row[0] for row in conn.execute("SELECT version FROM projects") if valid_version(row[0])]
def bench():
for v in TEST_ALL_VERSIONS:
Version(v)
if __name__ == "__main__":
print(f"Loaded {len(TEST_ALL_VERSIONS):,} valid versions")
t = timeit.timeit("bench()", globals=globals(), number=5)
print(f"Time: {t:.4f} seconds")
print(f"Per version: {1_000_000 * t / len(TEST_ALL_VERSIONS) / 5:.9f} µs") |
Signed-off-by: Henry Schreiner <henryfs@princeton.edu>
Signed-off-by: Henry Schreiner <henryfs@princeton.edu>
Signed-off-by: Henry Schreiner <henryfs@princeton.edu>
Signed-off-by: Henry Schreiner <henryfs@princeton.edu>
Signed-off-by: Henry Schreiner <henryfs@princeton.edu>
ee37416 to
1ccede1
Compare
This uses possessive qualifiers on 3.11+ to reduce backtracking. This makes the total time of my version creating benchmark 10-17% faster (that's everything, not just the regex!).
(These plots were from an earlier version, doing slightly better now)
Before:
After: