Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix PGN test by differentiating single and multigame PGN parsing #42

Merged
merged 9 commits into from
Oct 16, 2024

Conversation

KMK-Git
Copy link
Contributor

@KMK-Git KMK-Git commented Oct 3, 2024

This fixes #30 by adding another field to _PgnParser indicating whether it is parsing a single game PGN or multi game PGN. However I am marking this as draft because there is an untested scenario for multi game PGNs which will still fail.

The root cause is the following line which resets the parser and treats any empty/whitespace line as end of the game if the parser is in case _ParserState.moves.

if (_isWhitespace(line)) return _emit();

Post the fix:

  • For single game PGNs, empty lines are ignored in case _ParserState.moves (new behaviour).
  • For multi game PGNs, empty lines are treated as end of the game in case _ParserState.moves (current behaviour). This is required for current multi game PGN test scenarios to be parsed correctly.

This will resolve the issue for single game PGNs tested by the test case, but will still continue failing for multigame PGNs. A simple example scenario which will still fail:

[Variant "Standard"] 
  
 1. e4 e5
 2. Nf3 Nc6

 3. Bb5
  
[Variant "Standard"] 1. e4 e5 2. Nf3 Nc6 3. Bb5 { comment }  

This will be parsed as three separate games due to the empty line before 3. Bb5:

[Event "?"]
[Site "?"]
[Date "????.??.??"]
[Round "?"]
[White "?"]
[Black "?"]
[Result "*"]
[Variant "Standard"]

1. e4 e5 2. Nf3 Nc6 *

[Event "?"]
[Site "?"]
[Date "????.??.??"]
[Round "?"]
[White "?"]
[Black "?"]
[Result "*"]

1. Bb5 *

[Event "?"]
[Site "?"]
[Date "????.??.??"]
[Round "?"]
[White "?"]
[Black "?"]
[Result "*"]
[Variant "Standard"]

1. e4 e5 2. Nf3 Nc6 3. Bb5 { comment } *

Note that currently there are no test cases for the multigame PGN scenario described here.

@veloce My questions for you:

  1. Do we need to consider the possibility of empty whitespace lines in a multi game PGNs and find a solution to handle it as well?
  2. If we need to ignore whitespace lines, what alternative can be used to separate games in a multigame PGN? Some example test scenarios don't have headers, so the only method I can think of is move numbers (In the above scenario 3. Bb5 is obviously not the start of a new game). However move numbers seem to be completely ignored in the current parsing logic, so this might require significant changes.

@veloce
Copy link
Collaborator

veloce commented Oct 3, 2024

Good question.

We cannot support every weird syntax or edge cases. Are whitespace lines common in PGN or not? I have no idea.
How tricky do you estimate the fix, in terms of complexity and added maintenance cost? If the fix is cheap I'd say let's do it, if not then we can ignore it (perhaps with a test case showing the parser purposefully behaves like that).

Sorry I can't answer your second question, I haven't thought about it. So maybe that answers my question above :)

@KMK-Git
Copy link
Contributor Author

KMK-Git commented Oct 4, 2024

I looked at the lila repo and found something which might be relevant:

https://github.com/lichess-org/lila/blob/f458495d5145724bf49908d65aed2a294d6ed44d/modules/study/src/main/MultiPgn.scala#L11-L11

I have not worked with scala or the lila repo before this, but based on my understanding, when importing PGNs in studies the string is split on the regex \n\n(?=\[) (Empty line followed by [ indicating a tag for the next game). Each substring is then parsed as a separate game. This can also be replicated in this repository, simplifying the implementation for parsing a multi-game PGN string.

The main drawback of this solution will be that it will break an existing test case/scenario where there are no tags at the start of a game which is not the first game in a multi-game PGN string: https://github.com/lichess-org/dartchess/blob/main/data/leading-whitespace.pgn (See fourth game). However, this will be consistent with the current lila implementation. If you are still ok with this, I can have a go at implementing it.


If the above is unacceptable, the only alternative solution I can think of will require checking the move numbers. While possible to implement, it will be more complex than the above solution: the current move number will need to be tracked and a lookahead will need to be done for the next move number whenever an empty line is found.

@veloce
Copy link
Collaborator

veloce commented Oct 5, 2024

It is fine by me, this extra test case can be ignored, it is not an important one. It really makes sense that multi game png are separated by headers.

We should also keep this test case but with the expected result changed to make it clear we don't support this by specification.

@KMK-Git
Copy link
Contributor Author

KMK-Git commented Oct 8, 2024

Updated the PR and marked it Ready for review

Fix for #30, based on approach discussed above:

  • Updated MultiGamePGN parser to split the PGN string on the regex \n\s+(?=\[). After the split, each individual string is parsed separately.
  • Since the regex now manages splitting the PGN into multiple games, doing if (_isWhitespace(line)) return _emit(); is no longer needed.
  • Uncommented crafty test case, and updated the expected result for leading whitespace test to match new behaviour.

While working on this, I also implemented a fix for #38. This fix is in the Parser but the root cause is different from #30:
To recreate: Start of comment { needs to be the last character on a line. The same line should also have an end of comment } at any point on the same line before the }. Placement of ( or ) does not make any difference.
Example:
sides, primarily for the one which had not expected this line. }) 9... Nxg4 (9... Bg6 10. Nxe5 {
The issue: When { is found before end of line, there is no call to substring and the parser continues processing the same line as a whole. The logic to find the rest of the comment string uses indexOf('}'), but since the parser is at the same line it finds the } character of the previous comment. The parser goes into an infinite loop getting stuck on the same line.
The fix: When a { is the final character on a line, change the parser state to comment and stop processing the line (move to the next line).
I also added the testcase from #38 to validate the fix. This fix is an extension of changes made in #31

@KMK-Git KMK-Git marked this pull request as ready for review October 8, 2024 00:02
Copy link
Collaborator

@veloce veloce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this contribution! Just need some clarification on some tests and we can merge this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard to see the difference with the existing wcc_2023.pgn. What is it exactly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new testcase is the file attached in #38. The main difference is on the line shown in the screenshot of that issue. The original file does not have a newline after the {.

final game = PgnGame.parsePgn(data);
expect(
game.comments,
[
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure to understand what wasn't working before here. Parsing the wcc_2023 was working fine as far as I can tell.

Sorry if I didn't get right your explanation in the PR description. Can you try to explain quickly what kind of bug is tested here? (perhaps as a code comment in the test so future readers will know too).

Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can find an explanation in the second part of the reply here: #42 (comment) I have also updated the test cases to include a description of the exact scenario being tested (#38).

Just to be clear and avoid any confusion, this PR is fixing both #30 and #38. Both issues are bugs in the PGN parser but in separate areas of the code.

Copy link
Collaborator

@veloce veloce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation!

@veloce veloce merged commit c8c13d0 into lichess-org:main Oct 16, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve PGN parser
2 participants