Skip to content

Commit

Permalink
Fix bugs; ensure memory is released; simplify C++ interfacing;
Browse files Browse the repository at this point in the history
- Fix bug causing zero-length matches to be returned multiple times
- Use Latin 1 encoding with RE2 when unicode not requested
- Ensure memory is released:
  - put del calls in finally blocks
  - add missing del call for 'matches' array
- Remove Cython hacks for C++ that are no longer needed;
  use const keyword that has been supported for some time.
  Fixes Cython 0.24 compilation issue.
- Turn _re2.pxd into includes.pxi.
- remove some tests that are specific to internal Python modules _sre and sre
  • Loading branch information
andreasvc committed Apr 26, 2016
1 parent 415fd39 commit 224abc5
Show file tree
Hide file tree
Showing 16 changed files with 275 additions and 320 deletions.
24 changes: 14 additions & 10 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,22 +1,14 @@
all:
python setup.py build_ext --cython

install:
python setup.py install --user --cython

test: all
cp build/lib*-2.*/re2.so tests/
test: install
(cd tests && python re2_test.py)
(cd tests && python test_re.py)

py3:
python3 setup.py build_ext --cython

install3:
python3 setup.py install --user --cython

test3: py3
cp build/lib*-3.*/re2*.so tests/re2.so
test3: install3
(cd tests && python3 re2_test.py)
(cd tests && python3 test_re.py)

Expand All @@ -25,3 +17,15 @@ clean:
rm -rf src/*.so src/*.html &>/dev/null
rm -rf re2.so tests/re2.so &>/dev/null
rm -rf src/re2.cpp &>/dev/null

valgrind:
python3.5-dbg setup.py install --user --cython && \
(cd tests && valgrind --tool=memcheck --suppressions=../valgrind-python.supp \
--leak-check=full --show-leak-kinds=definite \
python3.5-dbg test_re.py)

valgrind2:
python3.5-dbg setup.py install --user --cython && \
(cd tests && valgrind --tool=memcheck --suppressions=../valgrind-python.supp \
--leak-check=full --show-leak-kinds=definite \
python3.5-dbg re2_test.py)
75 changes: 40 additions & 35 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -47,52 +47,59 @@ And in the above example, ``set_fallback_notification`` can handle 3 values:
``re.FALLBACK_QUIETLY`` (default), ``re.FALLBACK_WARNING`` (raises a warning), and
``re.FALLBACK_EXCEPTION`` (which raises an exception).

**Note**: The re2 module treats byte strings as UTF-8. This is fully backwards compatible with 7-bit ascii.
However, bytes containing values larger than 0x7f are going to be treated very differently in re2 than in re.
The RE library quietly ignores invalid utf8 in input strings, and throws an exception on invalid utf8 in patterns.
For example:

>>> re.findall(r'.', '\x80\x81\x82')
['\x80', '\x81', '\x82']
>>> re2.findall(r'.', '\x80\x81\x82')
[]

If you require the use of regular expressions over an arbitrary stream of bytes, then this library might not be for you.

Installation
============

To install, you must first install the prerequisites:

* The `re2 library from Google <http://code.google.com/p/re2/>`_
* The Python development headers (e.g. *sudo apt-get install python-dev*)
* A build environment with ``g++`` (e.g. *sudo apt-get install build-essential*)
* The Python development headers (e.g. ``sudo apt-get install python-dev``)
* A build environment with ``g++`` (e.g. ``sudo apt-get install build-essential``)
* Cython 0.20+ (``pip install cython``)

After the prerequisites are installed, you can install as follows::

$ git clone git://github.com/andreasvc/pyre2.git
$ cd pyre2
$ make install

After the prerequisites are installed, you can try installing using ``easy_install``::
(or ``make install3`` for Python 3)

$ sudo easy_install re2
Unicode Support
===============

if you have setuptools installed (or use ``pip``).
Python ``bytes`` and ``unicode`` strings are fully supported, but note that
``RE2`` works with UTF-8 encoded strings under the hood, which means that
``unicode`` strings need to be encoded and decoded back and forth.
There are two important factors:

If you don't want to use ``setuptools``, you can alternatively download the tarball from `pypi <http://pypi.python.org/pypi/re2/>`_.
* whether a ``unicode`` pattern and search string is used (will be encoded to UTF-8 internally)
* the ``UNICODE`` flag: whether operators such as ``\w`` recognize Unicode characters.

Alternative to those, you can clone this repository and try installing it from there. To do this, run::
To avoid the overhead of encoding and decoding to UTF-8, it is possible to pass
UTF-8 encoded bytes strings directly but still treat them as ``unicode``::

$ git clone git://github.com/axiak/pyre2.git
$ cd pyre2.git
$ sudo python setup.py install
In [18]: re2.findall(u'\w'.encode('utf8'), u'Mötley Crüe'.encode('utf8'), flags=re2.UNICODE)
Out[18]: ['M', '\xc3\xb6', 't', 'l', 'e', 'y', 'C', 'r', '\xc3\xbc', 'e']
In [19]: re2.findall(u'\w'.encode('utf8'), u'Mötley Crüe'.encode('utf8'))
Out[19]: ['M', 't', 'l', 'e', 'y', 'C', 'r', 'e']

If you want to make changes to the bindings, you must have Cython >=0.13.
However, note that the indices in ``Match`` objects will refer to the bytes string.
The indices of the match in the ``unicode`` string could be computed by
decoding/encoding, but this is done automatically and more efficiently if you
pass the ``unicode`` string::

Unicode Support
===============
>>> re2.search(u'ü'.encode('utf8'), u'Mötley Crüe'.encode('utf8'), flags=re2.UNICODE)
<re2.Match object; span=(10, 12), match='\xc3\xbc'>
>>> re2.search(u'ü', u'Mötley Crüe', flags=re2.UNICODE)
<re2.Match object; span=(9, 10), match=u'\xfc'>

Finally, if you want to match bytes without regard for Unicode characters,
pass bytes strings and leave out the ``UNICODE`` flag (this will cause Latin 1
encoding to be used with ``RE2`` under the hood)::

One current issue is Unicode support. As you may know, ``RE2`` supports UTF8,
which is certainly distinct from unicode. Right now the module will automatically
encode any unicode string into utf8 for you, which is *slow* (it also has to
decode utf8 strings back into unicode objects on every substitution or split).
Therefore, you are better off using bytestrings in utf8 while working with RE2
and encoding things after everything you need done is finished.
>>> re2.findall(br'.', b'\x80\x81\x82')
['\x80', '\x81', '\x82']

Performance
===========
Expand All @@ -104,7 +111,7 @@ I've found that occasionally python's regular ``re`` module is actually slightly
However, when the ``re`` module gets slow, it gets *really* slow, while this module
buzzes along.

In the below example, I'm running the data against 8MB of text from the collosal Wikipedia
In the below example, I'm running the data against 8MB of text from the colossal Wikipedia
XML file. I'm running them multiple times, being careful to use the ``timeit`` module.
To see more details, please see the `performance script <http://github.com/axiak/pyre2/tree/master/tests/performance.py>`_.

Expand All @@ -131,8 +138,6 @@ The tests show the following differences with Python's ``re`` module:
* ``pyre2`` and Python's ``re`` behave differently with nested and empty groups;
``pyre2`` will return an empty string in cases where Python would return None
for a group that did not participate in a match.
* Any bytestrings with invalid UTF-8 or other non-ASCII data may behave
differently.

Please report any further issues with ``pyre2``.

Expand Down Expand Up @@ -162,5 +167,5 @@ and Facebook for the initial inspiration. Plus, I got to
gut this readme file!

Moreover, this library would of course not be possible if not for
the immense work of the team at RE2 and the few people who work
the immense work of the team at ``RE2`` and the few people who work
on Cython.
7 changes: 5 additions & 2 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,11 @@ def run(self):

def version_compare(version1, version2):
def normalize(v):
return [int(x) for x in re.sub(r'(\.0+)*$','', v).split(".")]
return cmp(normalize(version1), normalize(version2))
return [int(x) for x in re.sub(r'(\.0+)*$', '', v).split(".")]
try:
return cmp(normalize(version1), normalize(version2))
except ValueError: # raised by e.g. '0.24b0'
return 1

cmdclass = {'test': TestCommand}

Expand Down
11 changes: 0 additions & 11 deletions src/_re2macros.h
Original file line number Diff line number Diff line change
Expand Up @@ -9,16 +9,5 @@ static inline re2::StringPiece * new_StringPiece_array(int n)
re2::StringPiece * sp = new re2::StringPiece[n];
return sp;
}
static inline void delete_StringPiece_array(re2::StringPiece* ptr)
{
delete[] ptr;
}

#define addressof(A) (&A)
#define addressofs(A) (&A)

#define as_char(A) (char *)(A)
#define pattern_Replace(A, B, C) re2::RE2::Replace((A), (B), (C))
#define pattern_GlobalReplace(A, B, C) re2::RE2::GlobalReplace((A), (B), (C))

#endif
48 changes: 22 additions & 26 deletions src/compile.pxi
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,6 @@ def _compile(object pattern, int flags=0, int max_mem=8388608):
"""Compile a regular expression pattern, returning a pattern object."""
def fallback(pattern, flags, error_msg):
"""Raise error, warn, or simply return fallback from re module."""
error_msg = "re.LOCALE not supported"
if current_notification == FALLBACK_EXCEPTION:
raise RegexError(error_msg)
elif current_notification == FALLBACK_WARNING:
Expand All @@ -26,8 +25,8 @@ def _compile(object pattern, int flags=0, int max_mem=8388608):
raise RegexError(*err.args)
return result

cdef _re2.StringPiece * s
cdef _re2.Options opts
cdef StringPiece * s
cdef Options opts
cdef int error_code
cdef int encoded = 0
cdef object original_pattern
Expand All @@ -44,13 +43,13 @@ def _compile(object pattern, int flags=0, int max_mem=8388608):
pattern = unicode_to_bytes(pattern, &encoded, -1)
newflags = flags
if not PY2:
if not encoded and flags & _U:
pass
if not encoded and flags & _U: # re.UNICODE
pass # can use UNICODE with bytes pattern, but assumes valid UTF-8
# raise ValueError("can't use UNICODE flag with a bytes pattern")
elif encoded and not (flags & re.ASCII):
newflags = flags | re.UNICODE
newflags = flags | _U # re.UNICODE
elif encoded and flags & re.ASCII:
newflags = flags & ~re.UNICODE
newflags = flags & ~_U # re.UNICODE
try:
pattern = _prepare_pattern(pattern, newflags)
except BackreferencesException:
Expand All @@ -59,22 +58,23 @@ def _compile(object pattern, int flags=0, int max_mem=8388608):
return fallback(original_pattern, flags,
"\W and \S not supported inside character classes")


# Set the options given the flags above.
if flags & _I:
opts.set_case_sensitive(0);

opts.set_max_mem(max_mem)
opts.set_log_errors(0)
opts.set_encoding(_re2.EncodingUTF8)
if flags & _U or encoded:
opts.set_encoding(EncodingUTF8)
else: # re.UNICODE flag not passed, and pattern is bytes,
# so allow matching of arbitrary byte sequences.
opts.set_encoding(EncodingLatin1)

s = new _re2.StringPiece(<char *><bytes>pattern, len(pattern))
s = new StringPiece(<char *><bytes>pattern, len(pattern))

cdef _re2.RE2 *re_pattern
cdef _re2.const_stringintmap * named_groups
cdef _re2.stringintmapiterator it
cdef RE2 *re_pattern
with nogil:
re_pattern = new _re2.RE2(s[0], opts)
re_pattern = new RE2(s[0], opts)

if not re_pattern.ok():
# Something went wrong with the compilation.
Expand All @@ -85,9 +85,9 @@ def _compile(object pattern, int flags=0, int max_mem=8388608):
if current_notification == FALLBACK_EXCEPTION:
# Raise an exception regardless of the type of error.
raise RegexError(error_msg)
elif error_code not in (_re2.ErrorBadPerlOp, _re2.ErrorRepeatSize,
# _re2.ErrorBadEscape,
_re2.ErrorPatternTooLarge):
elif error_code not in (ErrorBadPerlOp, ErrorRepeatSize,
# ErrorBadEscape,
ErrorPatternTooLarge):
# Raise an error because these will not be fixed by using the
# ``re`` module.
raise RegexError(error_msg)
Expand All @@ -96,24 +96,20 @@ def _compile(object pattern, int flags=0, int max_mem=8388608):
return re.compile(original_pattern, flags)

cdef Pattern pypattern = Pattern()
cdef map[cpp_string, int] named_groups = re_pattern.NamedCapturingGroups()
pypattern.pattern = original_pattern
pypattern.re_pattern = re_pattern
pypattern.groups = re_pattern.NumberOfCapturingGroups()
pypattern.encoded = encoded
pypattern.flags = flags
pypattern.groupindex = {}
named_groups = _re2.addressof(re_pattern.NamedCapturingGroups())
it = named_groups.begin()
while it != named_groups.end():
for it in named_groups:
if encoded:
pypattern.groupindex[cpp_to_unicode(deref(it).first)
] = deref(it).second
pypattern.groupindex[cpp_to_unicode(it.first)] = it.second
else:
pypattern.groupindex[cpp_to_bytes(deref(it).first)
] = deref(it).second
inc(it)
pypattern.groupindex[cpp_to_bytes(it.first)] = it.second

if flags & re.DEBUG:
if flags & DEBUG:
print(repr(pypattern._dump_pattern()))
del s
return pypattern
Expand Down
Loading

0 comments on commit 224abc5

Please sign in to comment.