Fix PGN test by differentiating single and multigame PGN parsing #42

KMK-Git · 2024-10-03T04:03:54Z

This fixes #30 by adding another field to _PgnParser indicating whether it is parsing a single game PGN or multi game PGN. However I am marking this as draft because there is an untested scenario for multi game PGNs which will still fail.

The root cause is the following line which resets the parser and treats any empty/whitespace line as end of the game if the parser is in case _ParserState.moves.

dartchess/lib/src/pgn.dart

Line 779 in 65fad86

if (_isWhitespace(line)) return _emit();

Post the fix:

For single game PGNs, empty lines are ignored in case _ParserState.moves (new behaviour).
For multi game PGNs, empty lines are treated as end of the game in case _ParserState.moves (current behaviour). This is required for current multi game PGN test scenarios to be parsed correctly.

This will resolve the issue for single game PGNs tested by the test case, but will still continue failing for multigame PGNs. A simple example scenario which will still fail:

[Variant "Standard"] 
  
 1. e4 e5
 2. Nf3 Nc6

 3. Bb5
  
[Variant "Standard"] 1. e4 e5 2. Nf3 Nc6 3. Bb5 { comment }

This will be parsed as three separate games due to the empty line before 3. Bb5:

[Event "?"]
[Site "?"]
[Date "????.??.??"]
[Round "?"]
[White "?"]
[Black "?"]
[Result "*"]
[Variant "Standard"]

1. e4 e5 2. Nf3 Nc6 *

[Event "?"]
[Site "?"]
[Date "????.??.??"]
[Round "?"]
[White "?"]
[Black "?"]
[Result "*"]

1. Bb5 *

[Event "?"]
[Site "?"]
[Date "????.??.??"]
[Round "?"]
[White "?"]
[Black "?"]
[Result "*"]
[Variant "Standard"]

1. e4 e5 2. Nf3 Nc6 3. Bb5 { comment } *

Note that currently there are no test cases for the multigame PGN scenario described here.

@veloce My questions for you:

Do we need to consider the possibility of empty whitespace lines in a multi game PGNs and find a solution to handle it as well?
If we need to ignore whitespace lines, what alternative can be used to separate games in a multigame PGN? Some example test scenarios don't have headers, so the only method I can think of is move numbers (In the above scenario 3. Bb5 is obviously not the start of a new game). However move numbers seem to be completely ignored in the current parsing logic, so this might require significant changes.

…arsed, handle linebreaks accordingly

veloce · 2024-10-03T06:49:49Z

Good question.

We cannot support every weird syntax or edge cases. Are whitespace lines common in PGN or not? I have no idea.
How tricky do you estimate the fix, in terms of complexity and added maintenance cost? If the fix is cheap I'd say let's do it, if not then we can ignore it (perhaps with a test case showing the parser purposefully behaves like that).

Sorry I can't answer your second question, I haven't thought about it. So maybe that answers my question above :)

KMK-Git · 2024-10-04T01:04:58Z

I looked at the lila repo and found something which might be relevant:

https://github.com/lichess-org/lila/blob/f458495d5145724bf49908d65aed2a294d6ed44d/modules/study/src/main/MultiPgn.scala#L11-L11

I have not worked with scala or the lila repo before this, but based on my understanding, when importing PGNs in studies the string is split on the regex \n\n(?=\[) (Empty line followed by [ indicating a tag for the next game). Each substring is then parsed as a separate game. This can also be replicated in this repository, simplifying the implementation for parsing a multi-game PGN string.

The main drawback of this solution will be that it will break an existing test case/scenario where there are no tags at the start of a game which is not the first game in a multi-game PGN string: https://github.com/lichess-org/dartchess/blob/main/data/leading-whitespace.pgn (See fourth game). However, this will be consistent with the current lila implementation. If you are still ok with this, I can have a go at implementing it.

If the above is unacceptable, the only alternative solution I can think of will require checking the move numbers. While possible to implement, it will be more complex than the above solution: the current move number will need to be tracked and a lookahead will need to be done for the next move number whenever an empty line is found.

veloce · 2024-10-05T07:51:16Z

It is fine by me, this extra test case can be ignored, it is not an important one. It really makes sense that multi game png are separated by headers.

We should also keep this test case but with the expected result changed to make it clear we don't support this by specification.

This reverts commit 80f5af5.

… being parsed, handle linebreaks accordingly" This reverts commit db66704.

KMK-Git · 2024-10-08T00:02:47Z

Updated the PR and marked it Ready for review

Fix for #30, based on approach discussed above:

Updated MultiGamePGN parser to split the PGN string on the regex \n\s+(?=\[). After the split, each individual string is parsed separately.
Since the regex now manages splitting the PGN into multiple games, doing if (_isWhitespace(line)) return _emit(); is no longer needed.
Uncommented crafty test case, and updated the expected result for leading whitespace test to match new behaviour.

While working on this, I also implemented a fix for #38. This fix is in the Parser but the root cause is different from #30:
To recreate: Start of comment { needs to be the last character on a line. The same line should also have an end of comment } at any point on the same line before the }. Placement of ( or ) does not make any difference.
Example:
sides, primarily for the one which had not expected this line. }) 9... Nxg4 (9... Bg6 10. Nxe5 {
The issue: When { is found before end of line, there is no call to substring and the parser continues processing the same line as a whole. The logic to find the rest of the comment string uses indexOf('}'), but since the parser is at the same line it finds the } character of the previous comment. The parser goes into an infinite loop getting stuck on the same line.
The fix: When a { is the final character on a line, change the parser state to comment and stop processing the line (move to the next line).
I also added the testcase from #38 to validate the fix. This fix is an extension of changes made in #31

veloce

Thanks for this contribution! Just need some clarification on some tests and we can merge this.

veloce · 2024-10-09T16:02:30Z

data/wcc_2023_eolcomment.pgn

Hard to see the difference with the existing wcc_2023.pgn. What is it exactly?

The new testcase is the file attached in #38. The main difference is on the line shown in the screenshot of that issue. The original file does not have a newline after the {.

veloce · 2024-10-09T16:11:17Z

test/pgn_test.dart

+      final game = PgnGame.parsePgn(data);
+      expect(
+        game.comments,
+        [


Not sure to understand what wasn't working before here. Parsing the wcc_2023 was working fine as far as I can tell.

Sorry if I didn't get right your explanation in the PR description. Can you try to explain quickly what kind of bug is tested here? (perhaps as a code comment in the test so future readers will know too).

Thanks!

You can find an explanation in the second part of the reply here: #42 (comment) I have also updated the test cases to include a description of the exact scenario being tested (#38).

Just to be clear and avoid any confusion, this PR is fixing both #30 and #38. Both issues are bugs in the PGN parser but in separate areas of the code.

veloce

Thanks for the explanation!

KMK-Git added 2 commits October 2, 2024 23:01

Add flag to indicate whether single game or multi game pgn is being p…

db66704

…arsed, handle linebreaks accordingly

Fix comment

80f5af5

KMK-Git added 5 commits October 7, 2024 00:39

Fix for comment at end of line

03cc1a8

Revert "Fix comment"

e97e820

This reverts commit 80f5af5.

Revert "Add flag to indicate whether single game or multi game pgn is…

331c24b

… being parsed, handle linebreaks accordingly" This reverts commit db66704.

Remove extra print statement

62c00af

Fix empty line handling by presplitting multigame PGN

66361cc

KMK-Git marked this pull request as ready for review October 8, 2024 00:02

veloce reviewed Oct 9, 2024

View reviewed changes

KMK-Git added 2 commits October 14, 2024 22:32

Add a comment to describe new testcase

4d1aa8f

Attach comment to correct testcases

b9d3ec7

veloce approved these changes Oct 16, 2024

View reviewed changes

veloce merged commit c8c13d0 into lichess-org:main Oct 16, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix PGN test by differentiating single and multigame PGN parsing #42

Fix PGN test by differentiating single and multigame PGN parsing #42

KMK-Git commented Oct 3, 2024

veloce commented Oct 3, 2024

KMK-Git commented Oct 4, 2024

veloce commented Oct 5, 2024

KMK-Git commented Oct 8, 2024

veloce left a comment

veloce Oct 9, 2024

KMK-Git Oct 15, 2024

veloce Oct 9, 2024

KMK-Git Oct 15, 2024

veloce left a comment

Fix PGN test by differentiating single and multigame PGN parsing #42

Fix PGN test by differentiating single and multigame PGN parsing #42

Conversation

KMK-Git commented Oct 3, 2024

veloce commented Oct 3, 2024

KMK-Git commented Oct 4, 2024

veloce commented Oct 5, 2024

KMK-Git commented Oct 8, 2024

veloce left a comment

Choose a reason for hiding this comment

veloce Oct 9, 2024

Choose a reason for hiding this comment

KMK-Git Oct 15, 2024

Choose a reason for hiding this comment

veloce Oct 9, 2024

Choose a reason for hiding this comment

KMK-Git Oct 15, 2024

Choose a reason for hiding this comment

veloce left a comment

Choose a reason for hiding this comment