Fix bugs; ensure memory is released; simplify C++ interfacing;

- Fix bug causing zero-length matches to be returned multiple times - Use Latin 1 encoding with RE2 when unicode not requested - Ensure memory is released: - put del calls in finally blocks - add missing del call for 'matches' array - Remove Cython hacks for C++ that are no longer needed; use const keyword that has been supported for some time. Fixes Cython 0.24 compilation issue. - Turn _re2.pxd into includes.pxi. - remove some tests that are specific to internal Python modules _sre and sre
axiak · Apr 26, 2016 · 224abc5 · 224abc5
1 parent 415fd39
commit 224abc5
Show file tree

Hide file tree

Showing 16 changed files with 275 additions and 320 deletions.
diff --git a/Makefile b/Makefile
@@ -1,22 +1,14 @@
-all:
-	python setup.py build_ext --cython
-
 install:
 	python setup.py install --user --cython
 
-test: all
-	cp build/lib*-2.*/re2.so tests/
+test: install
 	(cd tests && python re2_test.py)
 	(cd tests && python test_re.py)
 
-py3:
-	python3 setup.py build_ext --cython
-
 install3:
 	python3 setup.py install --user --cython
 
-test3: py3
-	cp build/lib*-3.*/re2*.so tests/re2.so
+test3: install3
 	(cd tests && python3 re2_test.py)
 	(cd tests && python3 test_re.py)
 
@@ -25,3 +17,15 @@ clean:
 	rm -rf src/*.so src/*.html &>/dev/null
 	rm -rf re2.so tests/re2.so &>/dev/null
 	rm -rf src/re2.cpp &>/dev/null
+
+valgrind:
+	python3.5-dbg setup.py install --user --cython && \
+	(cd tests && valgrind --tool=memcheck --suppressions=../valgrind-python.supp \
+	--leak-check=full --show-leak-kinds=definite \
+	python3.5-dbg test_re.py)
+
+valgrind2:
+	python3.5-dbg setup.py install --user --cython && \
+	(cd tests && valgrind --tool=memcheck --suppressions=../valgrind-python.supp \
+	--leak-check=full --show-leak-kinds=definite \
+	python3.5-dbg re2_test.py)
diff --git a/README.rst b/README.rst
@@ -47,52 +47,59 @@ And in the above example, ``set_fallback_notification`` can handle 3 values:
 ``re.FALLBACK_QUIETLY`` (default), ``re.FALLBACK_WARNING`` (raises a warning), and
 ``re.FALLBACK_EXCEPTION`` (which raises an exception).
 
-**Note**: The re2 module treats byte strings as UTF-8. This is fully backwards compatible with 7-bit ascii.
-However, bytes containing values larger than 0x7f are going to be treated very differently in re2 than in re.
-The RE library quietly ignores invalid utf8 in input strings, and throws an exception on invalid utf8 in patterns.
-For example:
-
-    >>> re.findall(r'.', '\x80\x81\x82')
-    ['\x80', '\x81', '\x82']
-    >>> re2.findall(r'.', '\x80\x81\x82')
-    []
-
-If you require the use of regular expressions over an arbitrary stream of bytes, then this library might not be for you.
-
 Installation
 ============
 
 To install, you must first install the prerequisites:
 
 * The `re2 library from Google <http://code.google.com/p/re2/>`_
-* The Python development headers (e.g. *sudo apt-get install python-dev*)
-* A build environment with ``g++`` (e.g. *sudo apt-get install build-essential*)
+* The Python development headers (e.g. ``sudo apt-get install python-dev``)
+* A build environment with ``g++`` (e.g. ``sudo apt-get install build-essential``)
+* Cython 0.20+ (``pip install cython``)
+
+After the prerequisites are installed, you can install as follows::
+
+    $ git clone git://github.com/andreasvc/pyre2.git
+    $ cd pyre2
+    $ make install
 
-After the prerequisites are installed, you can try installing using ``easy_install``::
+(or ``make install3`` for Python 3)
 
-    $ sudo easy_install re2
+Unicode Support
+===============
 
-if you have setuptools installed (or use ``pip``).
+Python ``bytes`` and ``unicode`` strings are fully supported, but note that
+``RE2`` works with UTF-8 encoded strings under the hood, which means that
+``unicode`` strings need to be encoded and decoded back and forth.
+There are two important factors:
 
-If you don't want to use ``setuptools``, you can alternatively download the tarball from `pypi <http://pypi.python.org/pypi/re2/>`_.
+* whether a ``unicode`` pattern and search string is used (will be encoded to UTF-8 internally)
+* the ``UNICODE`` flag: whether operators such as ``\w`` recognize Unicode characters.
 
-Alternative to those, you can clone this repository and try installing it from there. To do this, run::
+To avoid the overhead of encoding and decoding to UTF-8, it is possible to pass
+UTF-8 encoded bytes strings directly but still treat them as ``unicode``::
 
-    $ git clone git://github.com/axiak/pyre2.git
-    $ cd pyre2.git
-    $ sudo python setup.py install
+    In [18]: re2.findall(u'\w'.encode('utf8'), u'Mötley Crüe'.encode('utf8'), flags=re2.UNICODE)
+    Out[18]: ['M', '\xc3\xb6', 't', 'l', 'e', 'y', 'C', 'r', '\xc3\xbc', 'e']
+    In [19]: re2.findall(u'\w'.encode('utf8'), u'Mötley Crüe'.encode('utf8'))
+    Out[19]: ['M', 't', 'l', 'e', 'y', 'C', 'r', 'e']
 
-If you want to make changes to the bindings, you must have Cython >=0.13.
+However, note that the indices in ``Match`` objects will refer to the bytes string.
+The indices of the match in the ``unicode`` string could be computed by
+decoding/encoding, but this is done automatically and more efficiently if you
+pass the ``unicode`` string::
 
-Unicode Support
-===============
+    >>> re2.search(u'ü'.encode('utf8'), u'Mötley Crüe'.encode('utf8'), flags=re2.UNICODE)
+    <re2.Match object; span=(10, 12), match='\xc3\xbc'>
+    >>> re2.search(u'ü', u'Mötley Crüe', flags=re2.UNICODE)
+    <re2.Match object; span=(9, 10), match=u'\xfc'>
+
+Finally, if you want to match bytes without regard for Unicode characters,
+pass bytes strings and leave out the ``UNICODE`` flag (this will cause Latin 1
+encoding to be used with ``RE2`` under the hood)::
 
-One current issue is Unicode support. As you may know, ``RE2`` supports UTF8,
-which is certainly distinct from unicode. Right now the module will automatically
-encode any unicode string into utf8 for you, which is *slow* (it also has to
-decode utf8 strings back into unicode objects on every substitution or split).
-Therefore, you are better off using bytestrings in utf8 while working with RE2
-and encoding things after everything you need done is finished.
+    >>> re2.findall(br'.', b'\x80\x81\x82')
+    ['\x80', '\x81', '\x82']
 
 Performance
 ===========
@@ -104,7 +111,7 @@ I've found that occasionally python's regular ``re`` module is actually slightly
 However, when the ``re`` module gets slow, it gets *really* slow, while this module
 buzzes along.
 
-In the below example, I'm running the data against 8MB of text from the collosal Wikipedia
+In the below example, I'm running the data against 8MB of text from the colossal Wikipedia
 XML file. I'm running them multiple times, being careful to use the ``timeit`` module.
 To see more details, please see the `performance script <http://github.com/axiak/pyre2/tree/master/tests/performance.py>`_.
 
@@ -131,8 +138,6 @@ The tests show the following differences with Python's ``re`` module:
 * ``pyre2`` and Python's ``re`` behave differently with nested and empty groups;
   ``pyre2`` will return an empty string in cases where Python would return None
   for a group that did not participate in a match.
-* Any bytestrings with invalid UTF-8 or other non-ASCII data may behave
-  differently.
 
 Please report any further issues with ``pyre2``.
 
@@ -162,5 +167,5 @@ and Facebook for the initial inspiration. Plus, I got to
 gut this readme file!
 
 Moreover, this library would of course not be possible if not for
-the immense work of the team at RE2 and the few people who work
+the immense work of the team at ``RE2`` and the few people who work
 on Cython.
diff --git a/setup.py b/setup.py
@@ -29,8 +29,11 @@ def run(self):
 
 def version_compare(version1, version2):
     def normalize(v):
-        return [int(x) for x in re.sub(r'(\.0+)*$','', v).split(".")]
-    return cmp(normalize(version1), normalize(version2))
+        return [int(x) for x in re.sub(r'(\.0+)*$', '', v).split(".")]
+    try:
+        return cmp(normalize(version1), normalize(version2))
+    except ValueError:  # raised by e.g. '0.24b0'
+        return 1
 
 cmdclass = {'test': TestCommand}
 

diff --git a/src/_re2macros.h b/src/_re2macros.h
@@ -9,16 +9,5 @@ static inline re2::StringPiece * new_StringPiece_array(int n)
     re2::StringPiece * sp = new re2::StringPiece[n];
     return sp;
 }
-static inline void delete_StringPiece_array(re2::StringPiece* ptr)
-{
-    delete[] ptr;
-}
-
-#define addressof(A) (&A)
-#define addressofs(A) (&A)
-
-#define as_char(A) (char *)(A)
-#define pattern_Replace(A, B, C) re2::RE2::Replace((A), (B), (C))
-#define pattern_GlobalReplace(A, B, C) re2::RE2::GlobalReplace((A), (B), (C))
 
 #endif
diff --git a/src/compile.pxi b/src/compile.pxi
@@ -15,7 +15,6 @@ def _compile(object pattern, int flags=0, int max_mem=8388608):
     """Compile a regular expression pattern, returning a pattern object."""
     def fallback(pattern, flags, error_msg):
         """Raise error, warn, or simply return fallback from re module."""
-        error_msg = "re.LOCALE not supported"
         if current_notification == FALLBACK_EXCEPTION:
             raise RegexError(error_msg)
         elif current_notification == FALLBACK_WARNING:
@@ -26,8 +25,8 @@ def _compile(object pattern, int flags=0, int max_mem=8388608):
             raise RegexError(*err.args)
         return result
 
-    cdef _re2.StringPiece * s
-    cdef _re2.Options opts
+    cdef StringPiece * s
+    cdef Options opts
     cdef int error_code
     cdef int encoded = 0
     cdef object original_pattern
@@ -44,13 +43,13 @@ def _compile(object pattern, int flags=0, int max_mem=8388608):
     pattern = unicode_to_bytes(pattern, &encoded, -1)
     newflags = flags
     if not PY2:
-        if not encoded and flags & _U:
-            pass
+        if not encoded and flags & _U:  # re.UNICODE
+            pass  # can use UNICODE with bytes pattern, but assumes valid UTF-8
             # raise ValueError("can't use UNICODE flag with a bytes pattern")
         elif encoded and not (flags & re.ASCII):
-            newflags = flags | re.UNICODE
+            newflags = flags | _U  # re.UNICODE
         elif encoded and flags & re.ASCII:
-            newflags = flags & ~re.UNICODE
+            newflags = flags & ~_U  # re.UNICODE
     try:
         pattern = _prepare_pattern(pattern, newflags)
     except BackreferencesException:
@@ -59,22 +58,23 @@ def _compile(object pattern, int flags=0, int max_mem=8388608):
         return fallback(original_pattern, flags,
                 "\W and \S not supported inside character classes")
 
-
     # Set the options given the flags above.
     if flags & _I:
         opts.set_case_sensitive(0);
 
     opts.set_max_mem(max_mem)
     opts.set_log_errors(0)
-    opts.set_encoding(_re2.EncodingUTF8)
+    if flags & _U or encoded:
+        opts.set_encoding(EncodingUTF8)
+    else:  # re.UNICODE flag not passed, and pattern is bytes,
+        # so allow matching of arbitrary byte sequences.
+        opts.set_encoding(EncodingLatin1)
 
-    s = new _re2.StringPiece(<char *><bytes>pattern, len(pattern))
+    s = new StringPiece(<char *><bytes>pattern, len(pattern))
 
-    cdef _re2.RE2 *re_pattern
-    cdef _re2.const_stringintmap * named_groups
-    cdef _re2.stringintmapiterator it
+    cdef RE2 *re_pattern
     with nogil:
-         re_pattern = new _re2.RE2(s[0], opts)
+         re_pattern = new RE2(s[0], opts)
 
     if not re_pattern.ok():
         # Something went wrong with the compilation.
@@ -85,9 +85,9 @@ def _compile(object pattern, int flags=0, int max_mem=8388608):
         if current_notification == FALLBACK_EXCEPTION:
             # Raise an exception regardless of the type of error.
             raise RegexError(error_msg)
-        elif error_code not in (_re2.ErrorBadPerlOp, _re2.ErrorRepeatSize,
-                # _re2.ErrorBadEscape,
-                _re2.ErrorPatternTooLarge):
+        elif error_code not in (ErrorBadPerlOp, ErrorRepeatSize,
+                # ErrorBadEscape,
+                ErrorPatternTooLarge):
             # Raise an error because these will not be fixed by using the
             # ``re`` module.
             raise RegexError(error_msg)
@@ -96,24 +96,20 @@ def _compile(object pattern, int flags=0, int max_mem=8388608):
         return re.compile(original_pattern, flags)
 
     cdef Pattern pypattern = Pattern()
+    cdef map[cpp_string, int] named_groups = re_pattern.NamedCapturingGroups()
     pypattern.pattern = original_pattern
     pypattern.re_pattern = re_pattern
     pypattern.groups = re_pattern.NumberOfCapturingGroups()
     pypattern.encoded = encoded
     pypattern.flags = flags
     pypattern.groupindex = {}
-    named_groups = _re2.addressof(re_pattern.NamedCapturingGroups())
-    it = named_groups.begin()
-    while it != named_groups.end():
+    for it in named_groups:
         if encoded:
-            pypattern.groupindex[cpp_to_unicode(deref(it).first)
-                    ] = deref(it).second
+            pypattern.groupindex[cpp_to_unicode(it.first)] = it.second
         else:
-            pypattern.groupindex[cpp_to_bytes(deref(it).first)
-                    ] = deref(it).second
-        inc(it)
+            pypattern.groupindex[cpp_to_bytes(it.first)] = it.second
 
-    if flags & re.DEBUG:
+    if flags & DEBUG:
         print(repr(pypattern._dump_pattern()))
     del s
     return pypattern