Different tokenisation results between oniguruma and RE2 #225

bzz · 2019-04-12T13:31:22Z

Right now envy uses regex-based tokeniser (until #218 at least).

We have 2 drop-in replacement regex engines: default from Go RE2 and default from Ruby oniguruma.

Recent improvements done with Go module migration #219 surfaced a new issue: it seems that tokeniser produces a bit different results, depending on which regex engine is used :/

More specifically, the token frequencies built from linguist samples are different and high-level code-generator test catch by comparing with a fixture (pre-generated with RE2) and fail on oniguruma profiles like this #219 (comment)

We need to find the exact reason and depending on it decide, if we want to support 2 versions of fixtures or change something so there is no difference in output.

This also potentially affects #194

creachadair · 2019-04-12T17:09:42Z

Often I have found the differences between engines on the same regexp boil down to small variations in the handling of greed (in particular: whether operators are greedy by default), zero-width assertions, anchoring (e.g., whether it is on by default), case-sensitivity, and other meta-properties of the execution model that aren't well-specified. Obviously I don't know what the causes are in this case yet, but that's where I would start looking for examples.

bzz · 2019-04-16T17:59:54Z

Steps to reproduce

$ go test ./internal/code-generator/... -run Test_GeneratorTestSuite -testify.m TestGenerationFiles
ok  	gopkg.in/src-d/enry.v1/internal/code-generator/generator	35.537s

$ go test -tags oniguruma  ./internal/code-generator/... -run Test_GeneratorTestSuite -testify.m TestGenerationFiles
FAIL	gopkg.in/src-d/enry.v1/internal/code-generator/generator	34.583s

Both should produce the same results if results of tokenisation of all linguist samples are the same.;
Going to add extra tests to verify just that.

bzz · 2019-04-17T08:44:14Z

Tracked the problem down to the difference in handling what seems to be latin1 encoded file.
While tokenising thå filling encoded in latin1 and read as UTF-8

RE2 gets

th
filling
�

Oniguruma gets

th
illing
�
f

flex-based tokenised from #218 gets

th
filling

creachadair · 2019-04-17T14:10:47Z

Of the three, only Flex really seems to be doing anything reasonable here. RE2 is close—but the order of the tokens is weird. I have no idea what Oniguruma is doing there, but it seems obviously broken in at least two ways.

bzz · 2019-04-17T16:44:52Z

Yup. And if the content is decoded from latin1 and encoded to utf8 with charmap.ISO8859_1.NewDecoder().Bytes(content) the result is an expected one:

RE2

th
filling
å

oniguruma

th
filling
å

That would not bother me much, if 2 months ago Linguist did not add such case to their samples, and that is what content classifier is trained on and it is something we keep a "gold standard" results on, as part of our test fixtures.

I know that Linguist does use ICU-based character encoding detector https://github.com/brianmario/charlock_holmes but am not sure yet if it's part of the tokenisation.

creachadair · 2019-04-17T16:52:24Z

Yeah, I think either we should normalize the encoding or find a way to treat the Unicode replacement character as part of the token, e.g., xxx�yyy ⇒ xxx�yyy, or use it as a separator and discard it, e.g. ⇒ xxx � yyy ⇒ xxx yyy. It seems like Linguist does the latter maybe.

bzz · 2019-04-17T17:03:34Z

True. And linguist with flex-based tokeniser does not have this issue, so no need to encoding detection there. Thank you for suggestions, let me think a bit more about that..

bzz · 2019-05-06T18:33:37Z

After digging deeper - ot seems that the offending tokenization rule is extractAndReplaceRegular(content) that does [0-9A-Za-z_\.@#\/\*]+.

The extractRemainders is then called on it's output and does `bytes.Fields(content),

in RE2 case it's " \xe5 " which results in extra �
in Oniguruma it's "\xe5 f"

bzz · 2019-05-06T21:01:49Z

For the record: doing equivalent operation in Ruby where regex are backed by Oniguruma lib results in ArgumentError: invalid byte sequence in UTF-8

$irb

"th\xdd filling".scan(/[0-9A-Za-z_\.@#\/\*]+/)

bzz · 2019-05-07T08:54:34Z

Digging a little bit deeper with oniguruma's C API using awesome examples, it starts to look like this may be a bug in go-oniguruma Regexp.findAll() implementation 🤔

bzz · 2019-05-07T11:38:08Z

Even in C, oniguruma does consistently produce strange result in UTF8 mode for the non-valid input bytes in UTF8, like above. Seems like a possible bug upstream.

As all regex for tokenization in enry are not using any Unicode character classes, all RE2-based matches are conducted in ASCII-only mode, while go-oniguruma has hardcoded UTF8.

For our use-case the fix would be to force Oniguruma also use ASCII mode and that indeed produces identical results even for non-valid bytes in UTF8.

I will submit a patch to cgo part of https://github.com/src-d/go-oniguruma to have an option to override hardcoded UTF8, expose it in Go as MustCompileWithEncoding (similar to MustCompileWithOption) and as soon as it's merged, move entry regex/oniguruma.go to use it in #227.

Meanwhile, only a better test case was added in there a724a2f

This is a workaround, motivated by the difference in handling non-valid UTF8 bytes that Oniriguma has, compared to Go's default RE2. See src-d/enry#225 (comment) Summary of changes: - c: prevent `NewOnigRegex()` from hard-coding UTF8 - c: `NewOnigRegex()` now propely calls to `onig_initialize()` [1] - go: expose new `MustCompileASCII()` \w default charecter class matching only ASCII - go: `MustCompile()` refactored, `initRegexp()` extracted for common UTF8/ASCII logic Encoding was not exposed on Go API level intentionaly for simplisity, in order to avoid introducing complex struct type [2] to API surface. 1. https://github.com/kkos/oniguruma/blob/83572e983928243d741f61ac290fc057d69fefc3/doc/API#L6 2. https://github.com/kkos/oniguruma/blob/83572e983928243d741f61ac290fc057d69fefc3/src/oniguruma.h#L121 Signed-off-by: Alexander Bezzubov <[email protected]>

* add ASCII-only option, to mimic default RE2 behaviour This is a workaround, motivated by the difference in handling non-valid UTF8 bytes that Oniriguma has, compared to Go's default RE2. See src-d/enry#225 (comment) Summary of changes: - c: prevent `NewOnigRegex()` from hard-coding UTF8 - c: `NewOnigRegex()` now propely calls to `onig_initialize()` [1] - go: expose new `MustCompileASCII()` \w default charecter class matching only ASCII - go: `MustCompile()` refactored, `initRegexp()` extracted for common UTF8/ASCII logic Encoding was not exposed on Go API level intentionaly for simplisity, in order to avoid introducing complex struct type [2] to API surface. 1. https://github.com/kkos/oniguruma/blob/83572e983928243d741f61ac290fc057d69fefc3/doc/API#L6 2. https://github.com/kkos/oniguruma/blob/83572e983928243d741f61ac290fc057d69fefc3/src/oniguruma.h#L121 Signed-off-by: Alexander Bezzubov <[email protected]> * ci: test on 2 latest go versions Signed-off-by: Alexander Bezzubov <[email protected]> * ci: bump version of Oniguruma to 6.9.1 Update deb to get fix https://bugs.launchpad.net/ubuntu/+source/dpkg/+bug/1730627 Signed-off-by: Alexander Bezzubov <[email protected]> * ci: refactor oniguruma installation Signed-off-by: Alexander Bezzubov <[email protected]> * refactoring go part a bit, addressing review feedback Signed-off-by: Alexander Bezzubov <[email protected]> * ci: fix typo in bash var substitution Signed-off-by: Alexander Bezzubov <[email protected]> * cgo: simplify naive encoding init Signed-off-by: Alexander Bezzubov <[email protected]> * go: doc syntax fix Signed-off-by: Alexander Bezzubov <[email protected]> * tixing fypos Signed-off-by: Alexander Bezzubov <[email protected]>

bzz · 2019-05-08T21:03:14Z

Fixed by #227

bzz added the bug label Apr 12, 2019

bzz mentioned this issue Apr 12, 2019

Introduce Go modules #219

Merged

8 tasks

bzz self-assigned this Apr 14, 2019

bzz added this to the v2.0.0 milestone Apr 14, 2019

bzz mentioned this issue May 7, 2019

Add ASCII-only option, to mimic default RE2 behavior src-d/go-oniguruma#1

Merged

bzz closed this as completed May 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different tokenisation results between oniguruma and RE2 #225

Different tokenisation results between oniguruma and RE2 #225

bzz commented Apr 12, 2019

creachadair commented Apr 12, 2019 •

edited

Loading

bzz commented Apr 16, 2019

bzz commented Apr 17, 2019 •

edited

Loading

creachadair commented Apr 17, 2019

bzz commented Apr 17, 2019 •

edited

Loading

creachadair commented Apr 17, 2019

bzz commented Apr 17, 2019

bzz commented May 6, 2019 •

edited

Loading

bzz commented May 6, 2019 •

edited

Loading

bzz commented May 7, 2019 •

edited

Loading

bzz commented May 7, 2019 •

edited

Loading

bzz commented May 8, 2019

Different tokenisation results between oniguruma and RE2 #225

Different tokenisation results between oniguruma and RE2 #225

Comments

bzz commented Apr 12, 2019

creachadair commented Apr 12, 2019 • edited Loading

bzz commented Apr 16, 2019

bzz commented Apr 17, 2019 • edited Loading

creachadair commented Apr 17, 2019

bzz commented Apr 17, 2019 • edited Loading

creachadair commented Apr 17, 2019

bzz commented Apr 17, 2019

bzz commented May 6, 2019 • edited Loading

bzz commented May 6, 2019 • edited Loading

bzz commented May 7, 2019 • edited Loading

bzz commented May 7, 2019 • edited Loading

bzz commented May 8, 2019

creachadair commented Apr 12, 2019 •

edited

Loading

bzz commented Apr 17, 2019 •

edited

Loading

bzz commented Apr 17, 2019 •

edited

Loading

bzz commented May 6, 2019 •

edited

Loading

bzz commented May 6, 2019 •

edited

Loading

bzz commented May 7, 2019 •

edited

Loading

bzz commented May 7, 2019 •

edited

Loading