rejit is a simple regular expression just-in-time compiler for Python. The
project is intended for educational and experimental purposes.
rejit supports regular expressions, which describe regular languages, as in
formal language theory.
This means that rejit lacks advanced regexp features like backreferences,
because of too low expressive power. The upside of this is that rejit can use
deterministic finite automata,
which can accept input in linear time.
Rough plan to JIT compile regular expressions:
- parse regexp string to an AST
- transform the AST (optimization, simplification)
- construct a nondeterministic finite automaton from the AST
- convert the NFA to a DFA
- compile the DFA to some intermediate representation
- compile the IR to native code
- pack native code in an easy to call wrapper
rejit supports the following regexp features:
- union -
abc|def - concatenation -
abcd - Kleene star -
a* - Kleene plus -
b+ - question mark operator -
c? - grouping -
(a|b)c - any character -
. - character set (including character ranges) -
[a-zXYZ] - escaped special characters -
\.
Currently rejit can only decide whether a string exactly matches a regexp.
There is no search support.
rejit provides three types of matchers:
- NFA-based matcher - default, created implicitly when creating a
Regexobject - DFA-based matcher - a linear time matcher, created with
compile_to_DFA() - JIT compiled matcher - a linear time matcher, compiled to x86 machine code.
Created with
compile_to_x86()
Regular expressions in rejit can be used to check if a string looks like a
number. Here's a pretty bad attempt:
>>> import rejit.regex as re
>>> regex = re.Regex(r'\-?[0-9]*(\.[0-9]+)?')
>>> regex.accept('not a number')
False
>>> regex.accept('42')
True
>>> regex.accept('-1.0')
True
>>> regex.accept('.999')
True
>>> regex.accept('-.')
False
>>> regex.accept('-')
True
# ugh, a lone minus is accepted, but that's actually the regex' fault, not a bug
To speed up, we can convert regex matcher to a DFA from a default NFA.
>>> regex.compile_to_DFA()
# `regex` will now use a DFA-based matcher, which should work faster
>>> regex.accept('-1000.00')
True
For even larger speedup, we can JIT compile regex matcher to x86 machine code.
>>> regex.compile_to_x86()
# `regex` will now use a JIT compiled matcher, which should work even faster than a DFA one
>>> regex.accept('999.999')
True
>>> regex.accept('0xFF')
False
rejit package is distributed by source. Clone the repository:
git clone git@github.com:ziowk/rejit.git
cd rejit
Install the package with setuptools:
python setup.py install
Or install the development version with pip:
pip install -e .[dev]
And run tests with py.test (installed with the development version):
py.test
Supports Python 3 only.
JIT compilation is available on 32-bit and 64-bit Python on Windows and Linux.
No external dependencies.
rejit's classes are documented in docstrings. Automatic documentation
generation isn't set up yet, so no cute html docs to browse. Related issue
#26
rejit is licensed under the terms of the GPLv2 license.
For more information see LICENSE