Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A few tests #7

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

A few tests #7

wants to merge 2 commits into from

Conversation

austrin
Copy link
Contributor

@austrin austrin commented Nov 16, 2018

The first commit of this PR contains two tests where the antlr rewrite (#3) currently appears wrong, failing to run scripts with e.g. \( escapes in regexes, and scripts using string comparisons. These two you should want.

The second commit contains a few more interesting tests that you might not necessarily want, but they exhibit some discrepancies between past, current, and future versions of checktestdata, so think of them as highlighting things you might want to specify better:

  • testprogducktype.in: this sets variable a first to an integer, and then starts assigning a[i] as if it was an array. In current checktestdata this works fine, but in the antlr-rewrite this gives an error. Would be good to specify expected behaviour for this

  • testdataloopvar.in: this uses a loop iteration variable after the loop, in particular a WHILEI variable. In the current checktestdata, after the loop the value of the loop variable equals the number of iterations that were run, but in the antlr-rewrite it is one less. Would be good to specify expected behaviour for this.

  • testdatarecharclass.in: uses the common \s, \d, etc shorthands for character classes in regexes. This used to work in some older version of checktestdata, but does not work in the current version or in the antlr-rewrite. Would be good to specify exact regex dialect that is allowed in checktestdata.

  • testdatarerepetitions.in: uses an RE with a somewhat large number of repetitions [a-z]{1,1500}. In current checktestdata this works, in antl-rewrite this throws an exception. It is fair that max number of repetitions depends on implementation I guess, but a bit disappointing that the bound is so low in the rewrite.

  • testprogfloatproc.in: ok this is not particularly interesting, but it seems the antlr-rewrite uses a different floating point type than the current version -- with current checktestdata adding 0.004 to itself 250 times results in a value <= 1.0, but in the antlr-rewrite the floating point errors go the other way and one gets a value slightly larger than 1.0. This is probably not actionable in any way, just wanted to point it out.

These were found by running the two checktestdata versions on all checktestdata scripts for problems on Kattis (there were around 700 or so of them), so while some of them may sound a bit contrived, all of them did in some sense occur in practice.

Let me know if you think any of these are interesting enough to keep, and then I can drop the rest from the PR.

@eldering
Copy link
Member

Thanks, this looks very useful to me.
About specifying checktestdata behaviour better on the latter cases:

  • Assigning variables: I'd prefer to be flexible in variable assignments and treat "a" and "a[0]" as just the same variable, but with different array indices, the first being an empty set. The checktestdata program could output a warning if it encounters this.
  • Final loop variable value: I'd prefer it to be one beyond the last, since this behaviour matches what you'd expect from a loop variable in most languages. We could also leave it undefined, since nothing can break the loop, but if we'd introduce a "break" statement at any point, then the one beyond last behaviour makes a break in the last iteration distinghuishable from a completed loop.
  • Regex dialect: agreed that we should choose one, any common sounds fine to me, I'd opt for extented POSIX regex.
  • Regex limits: I'm not sure what the limits are currently (if any), but indeed 1500 is too low. Something like 10^5 sounds more reasonable, but unlimited strongly preferred.
  • Float precision: we can't expect a well-defined result here (although IEEE might specify one), but I think we should at least define a sharp range of allowed results. I would prefer to keep that well below the usual double precision limit.

@austrin
Copy link
Contributor Author

austrin commented Nov 16, 2018

Final loop variable value: I'd prefer it to be one beyond the last, since this behaviour matches what you'd expect from a loop variable in most languages. We could also leave it undefined, since nothing can break the loop, but if we'd introduce a "break" statement at any point, then the one beyond last behaviour makes a break in the last iteration distinghuishable from a completed loop.

Agree with your preference, because that seems most useful (but not sure I agree that it matches what happens in most languages -- in languages such as C++/Java this would be an error because the loop variable is not alive after the loop, and in Python a loop such as for i in range(n) will leave i at n-1after the loop).

Regex dialect: agreed that we should choose one, any common sounds fine to me, I'd opt for extented POSIX regex.

Do you know a good clear reference of what exactly that includes? For instance, when I search for it, one of the top hits is the boost documentation (https://www.boost.org/doc/libs/1_50_0/libs/regex/doc/html/boost_regex/syntax/basic_extended.html) -- that one claims to describe ERE's and does include stuff like \d instead of [[:digit:]], but as far as I understand that is strictly speaking not included in ERE's (since it is not mentioned here http://pubs.opengroup.org/onlinepubs/000095399/basedefs/xbd_chap09.html).

Regex limits: I'm not sure what the limits are currently (if any), but indeed 1500 is too low. Something like 10^5 sounds more reasonable, but unlimited strongly preferred.

With the current(*) version the limit seems to be somewhere around 50k: [a-z]{1,49000} works, but [a-z]{1,50000} results in an exception thrown with error message "Number of NFA states exceeds limit. Please use shorter regex string, or use smaller brace expression, or make _GLIBCXX_REGEX_STATE_LIMIT larger." (so it sounds like it is easy to reconfigure).

(*) throughout this thread when I say "current", I really mean the 2018-04-22 release version.

@TPolzer
Copy link
Contributor

TPolzer commented Nov 16, 2018

Thanks indeed, let me add my 2¢ on those points too:

  • Array dimensions: Making dimensionality a checked property of variables was intentional. I would think that using the same name as an array and as an integer is either a mistake or at least confusing.
  • Agreed on the loop behavior, the counter should probably be equal to the number of iterations afterwards.
  • Regex compatibility is a minefield. I chose RE2 for deterministic performance and a nice API. Switching on Perl character classes is possible as is "posix syntax" (I don't know what extended posix is specifically, but I don't think RE2 supports it).
  • Tuning the memory limit per regex should be trivial although performance of a multi megabyte (default limit is ~240KB) automaton might be worse than if you spell out your check imperatively.
  • The lower precision on floats than the current version is intentional, because in my opinion there will very rarely be a legitimate need for that kind of precision and it's not worth pessimizing every single floating point operation for that edge case.

In summary, I would discard testprogfloatproc.in, I'm not sure about testdatarecharclass.in and I would add testprogducktype.in as a fail case.

@austrin austrin mentioned this pull request Nov 16, 2018
@meisterT
Copy link
Member

I agree with Tobias on testprogducktype.in and would be rather strict there.

TPolzer added a commit that referenced this pull request Nov 16, 2018
Loop counter should be incremented after the last iteration.
@TPolzer
Copy link
Contributor

TPolzer commented Nov 16, 2018

Correction on the regex size, re2 docs (https://github.com/google/re2/wiki/Syntax) state:
"Implementation restriction: The counting forms x{n,m}, x{n,}, and x{n} reject forms that create a minimum or maximum repetition count above 1000. Unlimited repetitions are not subject to this restriction."

TPolzer added a commit that referenced this pull request Nov 16, 2018
TPolzer added a commit that referenced this pull request Nov 16, 2018
Single test case taken from #7. Thanks Per.
eldering added a commit that referenced this pull request Nov 19, 2018
…tions.

As pointed out by Per Austrin, the value of the loop variable after
the loop was undefined. Define it to be the number of iterations,
i.e. one beyond the last iteration value. Also add a modified test
case provided by Per in #7.
@eldering
Copy link
Member

Looking up basic/extended regex at http://pubs.opengroup.org/onlinepubs/000095399/basedefs/xbd_chap09.html and https://www.regular-expressions.info/posix.html, the main difference seems to be that in BRE some characters (){} need to be escaped to have their special meaning, and +?| are not supported, while in ERE they are and need not be escaped for their special meaning.
Also looking at https://www.regular-expressions.info/stdregex.html, indeed there's no single simple choice, but either ERE or ECMAscript seem the most sensible choices to me.

meisterT pushed a commit that referenced this pull request Oct 21, 2019
…tions.

As pointed out by Per Austrin, the value of the loop variable after
the loop was undefined. Define it to be the number of iterations,
i.e. one beyond the last iteration value. Also add a modified test
case provided by Per in #7.

(cherry picked from commit cd0d7ae)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants