Skip to content

Conversation

@henryiii
Copy link
Contributor

@henryiii henryiii commented Nov 25, 2025

This uses possessive qualifiers on 3.11+ to reduce backtracking. This makes the total time of my version creating benchmark 10-17% faster (that's everything, not just the regex!).

(These plots were from an earlier version, doing slightly better now)

Before:

python-performance-flamegraph

After:

python-performance-flamegraph-2

@henryiii
Copy link
Contributor Author

henryiii commented Nov 25, 2025

(CC @mattip) PyPy isn't working the same way as CPython with the new (in 3.11) atomics and possessives:

$ uv run --python=3.11 python
Using CPython 3.11.14
Removed virtual environment at: .venv
Creating virtual environment at: .venv
Installed 9 packages in 16ms
Python 3.11.14 (main, Nov 19 2025, 23:12:58) [Clang 21.1.4 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from packaging.version import VERSION_PATTERN
>>> import re
>>> _regex = re.compile(VERSION_PATTERN, re.VERBOSE | re.IGNORECASE)
>>> version = "1!3.0.0.rc2"
>>> _regex.fullmatch(version).groups()
('1', '3.0.0', '.rc2', 'rc', '2', None, None, None, None, None, None, None, None)

$ uv run --python=pypy3.11 python
Using PyPy 3.11.13
Removed virtual environment at: .venv
Creating virtual environment at: .venv
Installed 9 packages in 13ms
Python 3.11.13 (413c9b7f57f5, Jul 03 2025, 18:04:06)
[PyPy 7.3.20 with GCC Apple LLVM 16.0.0 (clang-1600.0.26.6)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>> from packaging.version import VERSION_PATTERN
>>>> import re
>>>> _regex = re.compile(VERSION_PATTERN, re.VERBOSE | re.IGNORECASE)
>>>> version = "1!3.0.0.rc2"
>>>> _regex.fullmatch(version).groups()
('1', '3.0.0.', 'rc2', None, None, None, None, None, None, None, None, None, None)

Small MWE of this do seem to be working. version = "3.0.0.rc2" has the same behavior as above. I can reproduce it with just possessives. It specifically is the possessive on the number inside the release segment. It seems that it also matches a dot (only in PyPy) if that possessive is present.

@henryiii henryiii force-pushed the henryiii/perf/regex branch 3 times, most recently from 609b1e5 to 3bf4a59 Compare November 25, 2025 17:00
@mattip
Copy link
Contributor

mattip commented Nov 25, 2025

PyPy isn't working the same way as CPython

What was the regex when they were different?

@henryiii
Copy link
Contributor Author

henryiii commented Nov 25, 2025

The current regex still differs:

_VERSION_PATTERN = r"""
    v?+                                                   # optional leading v
    (?:
        (?:(?P<epoch>[0-9]+)!)?                           # epoch
        (?P<release>[0-9]++(?:\.[0-9]++)*+)               # release segment
        (?P<pre>                                          # pre-release
            [._-]?+
            (?P<pre_l>alpha|a|beta|b|preview|pre|c|rc)
            [._-]?+
            (?P<pre_n>[0-9]++)?
        )?
        (?P<post>                                         # post release
            (?:-(?P<post_n1>[0-9]+))
            |
            (?:
                [._-]?
                (?P<post_l>post|rev|r)
                [._-]?
                (?P<post_n2>[0-9]+)?
            )
        )?
        (?P<dev>                                          # dev release
            [._-]?+
            (?P<dev_l>dev)
            [._-]?+
            (?P<dev_n>[0-9]++)?
        )?
    )
    (?:\+                                                 # local version
        (?P<local>
            [a-z0-9]++
            (?:[._-][a-z0-9]++)*+
        )
    )?
"""

If you remove just one possessive, then it almost works, but still differs in one test:

diff --git a/src/packaging/version.py b/src/packaging/version.py
index 6a07c2b..3abff64 100644
--- a/src/packaging/version.py
+++ b/src/packaging/version.py
@@ -119,7 +119,7 @@ _VERSION_PATTERN = r"""
     v?+                                                   # optional leading v
     (?:
         (?:(?P<epoch>[0-9]+)!)?                           # epoch
-        (?P<release>[0-9]++(?:\.[0-9]++)*+)               # release segment
+        (?P<release>[0-9]++(?:\.[0-9]+)*+)                # release segment
         (?P<pre>                                          # pre-release
             [._-]?+
             (?P<pre_l>alpha|a|beta|b|preview|pre|c|rc)
    def test_invalid_versions(self, version: str) -> None:
>       with pytest.raises(InvalidVersion):
E       Failed: DID NOT RAISE <class 'packaging.version.InvalidVersion'>

self       = <tests.test_version.TestVersion object at 0x0000000122f70790>
version    = '1.0+_foobar'

@henryiii
Copy link
Contributor Author

henryiii commented Nov 25, 2025

If I remove all instances of ++ it works the same as CPython. I think it fails to match CPython only when it's nested inside *+ (or maybe ?+).

@henryiii henryiii force-pushed the henryiii/perf/regex branch from 811ea93 to 9137a14 Compare November 25, 2025 17:35
@henryiii
Copy link
Contributor Author

The nested possessives don't really affect performance, so the current version is quite fast. I've added the failing example to my perf test.

Before: 1.6288 s After: 1.4862 s (CPython 3.14 homebrew)
Before: 1.3448 s After: 1.2080 s (CPython 3.14 uv)
Before: 1.3120 s After: 1.1164 s (CPython 3.11 uv)
Before: 0.5370 s After: 0.4658 s (PyPy3.11 uv)

@mattip
Copy link
Contributor

mattip commented Nov 25, 2025

If I remove all instances of ++ it works the same as CPython. I think it fails to match CPython only when it's nested inside *+ (or maybe ?+).

Thanks. It would be nice if you could reduce it to a smaller reproducer and open an issue at https://github.com/pypy/pypy/issues. Although it doesn't look like there will be another PyPy release at this point, so don't go to a lot of effort.

@henryiii henryiii force-pushed the henryiii/perf/regex branch from 9137a14 to b91cc16 Compare November 25, 2025 18:58
@henryiii henryiii marked this pull request as ready for review November 25, 2025 19:02
@henryiii henryiii force-pushed the henryiii/perf/regex branch from b91cc16 to ee37416 Compare November 25, 2025 19:05
@henryiii henryiii changed the title perf: faster regex perf: faster regex on Python 3.11+ Nov 25, 2025
@henryiii
Copy link
Contributor Author

henryiii commented Nov 25, 2025

Testing this on every version ever uploaded to PyPI, I get identical valid/invalid versions count, and get a 10% speedup on Python 3.13:

Before:

Loaded 8,168,377 valid versions
Time: 106.9584 seconds
Per version: 2.618840778 µs

After:

Loaded 8,168,377 valid versions
Time: 96.4139 seconds
Per version: 2.360663026 µs

(PyPy3.11 takes 25 seconds)

tasks/benchmark_versions.py:
# benchmark_versions.py
import sqlite3
import timeit
from packaging.version import Version, InvalidVersion

# Get data with:
# curl -L https://github.com/pypi-data/pypi-json-data/releases/download/latest/pypi-data.sqlite.gz | gzip -d > pypi-data.sqlite

def valid_version(v: str) -> bool:
    try:
        Version(v)
    except InvalidVersion:
        return False
    return True


with sqlite3.connect("pypi-data.sqlite") as conn:
    TEST_ALL_VERSIONS = [row[0] for row in conn.execute("SELECT version FROM projects") if valid_version(row[0])]

def bench():
    for v in TEST_ALL_VERSIONS:
        Version(v)


if __name__ == "__main__":
    print(f"Loaded {len(TEST_ALL_VERSIONS):,} valid versions")
    t = timeit.timeit("bench()", globals=globals(), number=5)
    print(f"Time: {t:.4f} seconds")
    print(f"Per version: {1_000_000 * t / len(TEST_ALL_VERSIONS) / 5:.9f} µs")

Signed-off-by: Henry Schreiner <henryfs@princeton.edu>
Signed-off-by: Henry Schreiner <henryfs@princeton.edu>
Signed-off-by: Henry Schreiner <henryfs@princeton.edu>
Signed-off-by: Henry Schreiner <henryfs@princeton.edu>
Signed-off-by: Henry Schreiner <henryfs@princeton.edu>
@henryiii henryiii force-pushed the henryiii/perf/regex branch from ee37416 to 1ccede1 Compare November 26, 2025 03:16
@henryiii henryiii merged commit f67a11e into pypa:main Nov 26, 2025
40 checks passed
@henryiii henryiii deleted the henryiii/perf/regex branch November 26, 2025 18:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants