A few tests #7

austrin · 2018-11-16T00:35:05Z

The first commit of this PR contains two tests where the antlr rewrite (#3) currently appears wrong, failing to run scripts with e.g. \( escapes in regexes, and scripts using string comparisons. These two you should want.

The second commit contains a few more interesting tests that you might not necessarily want, but they exhibit some discrepancies between past, current, and future versions of checktestdata, so think of them as highlighting things you might want to specify better:

testprogducktype.in: this sets variable a first to an integer, and then starts assigning a[i] as if it was an array. In current checktestdata this works fine, but in the antlr-rewrite this gives an error. Would be good to specify expected behaviour for this
testdataloopvar.in: this uses a loop iteration variable after the loop, in particular a WHILEI variable. In the current checktestdata, after the loop the value of the loop variable equals the number of iterations that were run, but in the antlr-rewrite it is one less. Would be good to specify expected behaviour for this.
testdatarecharclass.in: uses the common \s, \d, etc shorthands for character classes in regexes. This used to work in some older version of checktestdata, but does not work in the current version or in the antlr-rewrite. Would be good to specify exact regex dialect that is allowed in checktestdata.
testdatarerepetitions.in: uses an RE with a somewhat large number of repetitions [a-z]{1,1500}. In current checktestdata this works, in antl-rewrite this throws an exception. It is fair that max number of repetitions depends on implementation I guess, but a bit disappointing that the bound is so low in the rewrite.
testprogfloatproc.in: ok this is not particularly interesting, but it seems the antlr-rewrite uses a different floating point type than the current version -- with current checktestdata adding 0.004 to itself 250 times results in a value <= 1.0, but in the antlr-rewrite the floating point errors go the other way and one gets a value slightly larger than 1.0. This is probably not actionable in any way, just wanted to point it out.

These were found by running the two checktestdata versions on all checktestdata scripts for problems on Kattis (there were around 700 or so of them), so while some of them may sound a bit contrived, all of them did in some sense occur in practice.

Let me know if you think any of these are interesting enough to keep, and then I can drop the rest from the PR.

eldering · 2018-11-16T08:44:33Z

Thanks, this looks very useful to me.
About specifying checktestdata behaviour better on the latter cases:

Assigning variables: I'd prefer to be flexible in variable assignments and treat "a" and "a[0]" as just the same variable, but with different array indices, the first being an empty set. The checktestdata program could output a warning if it encounters this.
Final loop variable value: I'd prefer it to be one beyond the last, since this behaviour matches what you'd expect from a loop variable in most languages. We could also leave it undefined, since nothing can break the loop, but if we'd introduce a "break" statement at any point, then the one beyond last behaviour makes a break in the last iteration distinghuishable from a completed loop.
Regex dialect: agreed that we should choose one, any common sounds fine to me, I'd opt for extented POSIX regex.
Regex limits: I'm not sure what the limits are currently (if any), but indeed 1500 is too low. Something like 10^5 sounds more reasonable, but unlimited strongly preferred.
Float precision: we can't expect a well-defined result here (although IEEE might specify one), but I think we should at least define a sharp range of allowed results. I would prefer to keep that well below the usual double precision limit.

austrin · 2018-11-16T12:58:39Z

Final loop variable value: I'd prefer it to be one beyond the last, since this behaviour matches what you'd expect from a loop variable in most languages. We could also leave it undefined, since nothing can break the loop, but if we'd introduce a "break" statement at any point, then the one beyond last behaviour makes a break in the last iteration distinghuishable from a completed loop.

Agree with your preference, because that seems most useful (but not sure I agree that it matches what happens in most languages -- in languages such as C++/Java this would be an error because the loop variable is not alive after the loop, and in Python a loop such as for i in range(n) will leave i at n-1after the loop).

Regex dialect: agreed that we should choose one, any common sounds fine to me, I'd opt for extented POSIX regex.

Do you know a good clear reference of what exactly that includes? For instance, when I search for it, one of the top hits is the boost documentation (https://www.boost.org/doc/libs/1_50_0/libs/regex/doc/html/boost_regex/syntax/basic_extended.html) -- that one claims to describe ERE's and does include stuff like \d instead of [[:digit:]], but as far as I understand that is strictly speaking not included in ERE's (since it is not mentioned here http://pubs.opengroup.org/onlinepubs/000095399/basedefs/xbd_chap09.html).

Regex limits: I'm not sure what the limits are currently (if any), but indeed 1500 is too low. Something like 10^5 sounds more reasonable, but unlimited strongly preferred.

With the current(*) version the limit seems to be somewhere around 50k: [a-z]{1,49000} works, but [a-z]{1,50000} results in an exception thrown with error message "Number of NFA states exceeds limit. Please use shorter regex string, or use smaller brace expression, or make _GLIBCXX_REGEX_STATE_LIMIT larger." (so it sounds like it is easy to reconfigure).

(*) throughout this thread when I say "current", I really mean the 2018-04-22 release version.

TPolzer · 2018-11-16T17:30:22Z

Thanks indeed, let me add my 2¢ on those points too:

Array dimensions: Making dimensionality a checked property of variables was intentional. I would think that using the same name as an array and as an integer is either a mistake or at least confusing.
Agreed on the loop behavior, the counter should probably be equal to the number of iterations afterwards.
Regex compatibility is a minefield. I chose RE2 for deterministic performance and a nice API. Switching on Perl character classes is possible as is "posix syntax" (I don't know what extended posix is specifically, but I don't think RE2 supports it).
Tuning the memory limit per regex should be trivial although performance of a multi megabyte (default limit is ~240KB) automaton might be worse than if you spell out your check imperatively.
The lower precision on floats than the current version is intentional, because in my opinion there will very rarely be a legitimate need for that kind of precision and it's not worth pessimizing every single floating point operation for that edge case.

In summary, I would discard testprogfloatproc.in, I'm not sure about testdatarecharclass.in and I would add testprogducktype.in as a fail case.

meisterT · 2018-11-16T19:07:35Z

I agree with Tobias on testprogducktype.in and would be rather strict there.

Loop counter should be incremented after the last iteration.

TPolzer · 2018-11-16T22:13:12Z

Correction on the regex size, re2 docs (https://github.com/google/re2/wiki/Syntax) state:
"Implementation restriction: The counting forms x{n,m}, x{n,}, and x{n} reject forms that create a minimum or maximum repetition count above 1000. Unlimited repetitions are not subject to this restriction."

Single test case taken from #7. Thanks Per.

…tions. As pointed out by Per Austrin, the value of the loop variable after the loop was undefined. Define it to be the number of iterations, i.e. one beyond the last iteration value. Also add a modified test case provided by Per in #7.

eldering · 2018-11-19T23:20:13Z

Looking up basic/extended regex at http://pubs.opengroup.org/onlinepubs/000095399/basedefs/xbd_chap09.html and https://www.regular-expressions.info/posix.html, the main difference seems to be that in BRE some characters (){} need to be escaped to have their special meaning, and +?| are not supported, while in ERE they are and need not be escaped for their special meaning.
Also looking at https://www.regular-expressions.info/stdregex.html, indeed there's no single simple choice, but either ERE or ECMAscript seem the most sensible choices to me.

…tions. As pointed out by Per Austrin, the value of the loop variable after the loop was undefined. Define it to be the number of iterations, i.e. one beyond the last iteration value. Also add a modified test case provided by Per in #7. (cherry picked from commit cd0d7ae)

austrin added 2 commits November 16, 2018 01:07

add two tests (escapes of regex operators, and string comparisons)

2748007

add various tests where correct behaviour is currently not well-defined

190d2af

austrin mentioned this pull request Nov 16, 2018

Improve performance #3

Open

TPolzer added a commit that referenced this pull request Nov 16, 2018

Merge loopvar test from #7 and fix it.

ded494f

Loop counter should be incremented after the last iteration.

TPolzer added a commit that referenced this pull request Nov 16, 2018

Add ducktype test from #7 as a failure case.

05ee0e0

TPolzer added a commit that referenced this pull request Nov 16, 2018

Add string comparison functionality and test.

61f54c2

Single test case taken from #7. Thanks Per.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

A few tests #7

A few tests #7

Uh oh!

austrin commented Nov 16, 2018

Uh oh!

eldering commented Nov 16, 2018

Uh oh!

austrin commented Nov 16, 2018

Uh oh!

TPolzer commented Nov 16, 2018

Uh oh!

meisterT commented Nov 16, 2018

Uh oh!

TPolzer commented Nov 16, 2018

Uh oh!

eldering commented Nov 19, 2018

Uh oh!

Uh oh!

A few tests #7

Are you sure you want to change the base?

A few tests #7

Uh oh!

Conversation

austrin commented Nov 16, 2018

Uh oh!

eldering commented Nov 16, 2018

Uh oh!

austrin commented Nov 16, 2018

Uh oh!

TPolzer commented Nov 16, 2018

Uh oh!

meisterT commented Nov 16, 2018

Uh oh!

TPolzer commented Nov 16, 2018

Uh oh!

eldering commented Nov 19, 2018

Uh oh!

Uh oh!