Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate alignment deuterocanonical books #284

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

mudiagaobrikisil
Copy link
Collaborator

@mudiagaobrikisil mudiagaobrikisil commented Jan 22, 2025

Contains new tests to check for possible issues with alignment, mapping and versification of deuterocanonical books.
The tests are as follows:

  • GetVersesInVersification_ButNotInSourceOrTarget: Check the vrs files for verses contained that are not contained in SFM files. Doesn't fail if such verses are found. It just outputs the verses and continues gracefully.
  • GetVersesInSourceOrTarget_ButNotInVersification: Checks the SFM files for verses contained but not captured by the vrs files. Doesn't fail if such verses are found. It just outputs the verses and continues gracefully.
  • ValidateCrossBookMappingsAcrossVersifications: Test to ensure that cross book mappings specified in the versification files work e.g. SUS 1:1-63" mapped to "DAG 13:1-63".
  • GetDoubleMappingsAcrossVersifications: Test to get cases of double mappings. Doesn't fail if found. Just outputs it and continues gracefully.
  • EnsureFirstAndLastRowExist: Checks the first and last rows exist in the SFM files.
  • GetRows_DeuterocanonicalBooksFullCoverage: This test verifies that deuterocanonical books are correctly represented and aligned in a parallel corpus
  • GetRows_AllDeuterocanonicalBooks_WithAlignments: This test verifies the parallel corpus functionality for all deuterocanonical books, while also checking the alignment data between source and target texts. It ensures the correct creation and alignment of text rows and word pairs.
  • ValidateReferencesWithAllVersifications: This test validates the handling of scripture references across different versifications in a parallel corpus. The test ensures that the source and target corpora, as well as their alignments, maintain consistency and accuracy for the specified versification, book, and verse reference.
  • GetRows_MultipleRowsPerBookWithMismatches: This test is designed to validate the behavior of a parallel text corpus when there are mismatches between the source and target corpora. It examines how the system handles such mismatches and reports any discrepancies in scripture references, segments, or alignments.
  • ValidateSourceAndTargetReferencesWithBackup: This test validates that source and target scripture references in a parallel text corpus are correctly aligned for specific books and verses. It ensures that the source and target corpora, when combined with an alignment corpus, produce consistent and expected results for references.

It also contains some helper methods.


This change is Reviewable

@codecov-commenter
Copy link

codecov-commenter commented Jan 22, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.32%. Comparing base (3a9b17e) to head (6c457a9).

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #284      +/-   ##
==========================================
+ Coverage   70.28%   70.32%   +0.04%     
==========================================
  Files         385      386       +1     
  Lines       32019    32036      +17     
  Branches     4504     4507       +3     
==========================================
+ Hits        22503    22529      +26     
+ Misses       8471     8463       -8     
+ Partials     1045     1044       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a quick scan of the PR looking more for coding issues, but I hope to look at the Google Doc and look over the issue and see if I have any big picture feedback. Thanks, Mudi!

Reviewed 4 of 4 files at r1, all commit messages.
Reviewable status: all files reviewed, 10 unresolved discussions (waiting on @mudiagaobrikisil)


tests/SIL.Machine.Tests/Corpora/ParallelTextCorpusTests.cs line 1273 at r1 (raw file):

        };

        Versification.Table.Implementation.RemoveAllUnknownVersifications();

Why is this line needed?


tests/SIL.Machine.Tests/Corpora/ParallelTextCorpusTests.cs line 1388 at r1 (raw file):

        if (!string.IsNullOrEmpty(mismatchReport))
        {
            TestContext.WriteLine("Mismatches found:");

Shouldn't this test fail if there are mismatches? Could we pass the mismatchReport variable as the message in the exception that's being thrown?


tests/SIL.Machine.Tests/Corpora/ParallelTextCorpusTests.cs line 1411 at r1 (raw file):

    public void ValidateSourceAndTargetReferencesWithBackup(string bookId, string verseRef)
    {
        string sourceCorpusFile = CorporaTestHelpers.TestDataPath + "/Source - LAT.zip";

It's in general not a good idea to commit the binary zip files. We've been moving toward checking in the unzipped Paratext projects. You can just use ParatextTextCorpus instead of ParatextBackUpTextCorpus.


tests/SIL.Machine.Tests/Corpora/ParallelTextCorpusTests.cs line 1427 at r1 (raw file):

        );

        ParallelTextCorpus parallelCorpus = new ParallelTextCorpus(sourceCorpus, targetCorpus, alignments)

Why not use the AlignRows function?


tests/SIL.Machine.Tests/Corpora/ParallelTextCorpusTests.cs line 1512 at r1 (raw file):

    public void ValidateReferencesWithAllVersifications(ScrVers versification, string bookId, string verseRef)
    {
        string sourceCorpusFile = CorporaTestHelpers.TestDataPath + "/Source - LAT.zip";

Maybe some of this could be moved to a SetUp function if it's consistent across all the tests?


tests/SIL.Machine.Tests/Corpora/ParallelTextCorpusTests.cs line 1596 at r1 (raw file):

            AllTargetRows = true
        };
        ParallelTextRow[] rows = parallelCorpus.ToArray();

You should be able to just do parallelCorpus.GetRows(new string[] {bookId}) instead of using the Where statement.


tests/SIL.Machine.Tests/Corpora/ParallelTextCorpusTests.cs line 1710 at r1 (raw file):

        ScrVers versification = versificationType switch
        {
            ScrVersType.Original => ScrVers.Original,

I think you have a helper function below that does this, right?


tests/SIL.Machine.Tests/Corpora/ParallelTextCorpusTests.cs line 1800 at r1 (raw file):

        Dictionary<string, string> expectedMappings = new Dictionary<string, string>
        {
            // { "S3Y 1:1-29", "DAG 3:24-52" },

Is there a reason to leave these commented out rows in?


tests/SIL.Machine.Tests/Corpora/ParallelTextCorpusTests.cs line 1897 at r1 (raw file):

                        // Clean and normalize content for comparison
                        string[] unwanted = { "÷" };

Why remove these before comparison? Aren't the source and target the same?


tests/SIL.Machine.Tests/Corpora/TestData/Source - LAT.zip line 0 at r1 (raw file):
❗ If this is the source text I sent you (and not the one Peter offered to construct), it's not public domain and all record of it needs to be removed from the repo.

Copy link
Collaborator Author

@mudiagaobrikisil mudiagaobrikisil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 10 unresolved discussions (waiting on @Enkidu93)


tests/SIL.Machine.Tests/Corpora/ParallelTextCorpusTests.cs line 1411 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

It's in general not a good idea to commit the binary zip files. We've been moving toward checking in the unzipped Paratext projects. You can just use ParatextTextCorpus instead of ParatextBackUpTextCorpus.

Noted. Will need a little help on that.


tests/SIL.Machine.Tests/Corpora/ParallelTextCorpusTests.cs line 1512 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Maybe some of this could be moved to a SetUp function if it's consistent across all the tests?

Noted. Will refactor


tests/SIL.Machine.Tests/Corpora/ParallelTextCorpusTests.cs line 1596 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

You should be able to just do parallelCorpus.GetRows(new string[] {bookId}) instead of using the Where statement.

Noted. Will refactor


tests/SIL.Machine.Tests/Corpora/ParallelTextCorpusTests.cs line 1710 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

I think you have a helper function below that does this, right?

Yes. Will refactor


tests/SIL.Machine.Tests/Corpora/ParallelTextCorpusTests.cs line 1800 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Is there a reason to leave these commented out rows in?

Yes. They are also mappings found in the vrs file. Well, I will take them out. One mapping is sufficient for the test.


tests/SIL.Machine.Tests/Corpora/ParallelTextCorpusTests.cs line 1897 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Why remove these before comparison? Aren't the source and target the same?

No they are not the same due to the presence of that character in the target files


tests/SIL.Machine.Tests/Corpora/TestData/Source - LAT.zip line at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

❗ If this is the source text I sent you (and not the one Peter offered to construct), it's not public domain and all record of it needs to be removed from the repo.

Noted. Will do that

@Enkidu93
Copy link
Collaborator

tests/SIL.Machine.Tests/Corpora/ParallelTextCorpusTests.cs line 1411 at r1 (raw file):

Previously, mudiagaobrikisil wrote…

Noted. Will need a little help on that.

You should be able to just extract the binary like a normal .zip file and put it in the repo. See here for an example of the unzipped project being used. If that doesn't clear it up, let me know & we can chat.

Copy link
Collaborator Author

@mudiagaobrikisil mudiagaobrikisil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 10 unresolved discussions (waiting on @Enkidu93)


tests/SIL.Machine.Tests/Corpora/ParallelTextCorpusTests.cs line 1411 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

You should be able to just extract the binary like a normal .zip file and put it in the repo. See here for an example of the unzipped project being used. If that doesn't clear it up, let me know & we can chat.

I understand that part. However, since you said the files are not for public domain yet, that is why I need to help and some explanations

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 2 of 4 files reviewed, 11 unresolved discussions (waiting on @mudiagaobrikisil)


tests/SIL.Machine.Tests/Corpora/ParallelTextCorpusTests.cs line 1916 at r2 (raw file):

                Versification = versification
            };
            ParatextBackupTextCorpus targetCorpus = new ParatextBackupTextCorpus(sourceCorpusFile)

You use sourceCorpusFile for both sourceCorpus and targetCorpus, so shouldn't they be the same below?

Copy link
Collaborator Author

@mudiagaobrikisil mudiagaobrikisil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 2 of 4 files reviewed, 11 unresolved discussions (waiting on @Enkidu93 and @johnml1135)


tests/SIL.Machine.Tests/Corpora/ParallelTextCorpusTests.cs line 1388 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Shouldn't this test fail if there are mismatches? Could we pass the mismatchReport variable as the message in the exception that's being thrown?

@johnml1135 suggested I log the issues out and let the tests pass gracefully, since we are not immediately addressing the issues and it would affect the CI/CD pipeline if the tests fail.


tests/SIL.Machine.Tests/Corpora/ParallelTextCorpusTests.cs line 1427 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Why not use the AlignRows function?

Done


tests/SIL.Machine.Tests/Corpora/ParallelTextCorpusTests.cs line 1916 at r2 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

You use sourceCorpusFile for both sourceCorpus and targetCorpus, so shouldn't they be the same below?

Refactored to use one Corpora

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 2 files at r2, 30 of 30 files at r3, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @johnml1135 and @mudiagaobrikisil)


tests/SIL.Machine.Tests/Corpora/ParallelTextCorpusTests.cs line 1273 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Why is this line needed?

So why include this line?


tests/SIL.Machine.Tests/Corpora/ParallelTextCorpusTests.cs line 1388 at r1 (raw file):

Previously, mudiagaobrikisil wrote…

@johnml1135 suggested I log the issues out and let the tests pass gracefully, since we are not immediately addressing the issues and it would affect the CI/CD pipeline if the tests fail.

OK, sounds good


tests/SIL.Machine.Tests/Corpora/CorporaTestHelpers.cs line 81 at r3 (raw file):

    /// </summary>
    /// <returns>A tuple containing the source corpus (first) and target corpus (second).</returns>
    public static (ParatextTextCorpus sourceCorpus, ParatextTextCorpus targetCorpus) GetDeuterocanonicalCorpora()

Is this used/necessary?

@johnml1135
Copy link
Collaborator

tests/SIL.Machine.Tests/Corpora/ParallelTextCorpusTests.cs line 1388 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

OK, sounds good

I may have misspoken - if there is a mismatch, we should fail and address the underlying issue. If it becomes too difficult, we can create a new issue to document the issue, but no one will ever look at the text output here. Please make the tests fail and if there are failures, we should understand the issue fully (and document in new issues) before knowledgeably removing tests.

Copy link
Collaborator Author

@mudiagaobrikisil mudiagaobrikisil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 29 of 35 files reviewed, 3 unresolved discussions (waiting on @Enkidu93 and @johnml1135)


tests/SIL.Machine.Tests/Corpora/CorporaTestHelpers.cs line 81 at r3 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Is this used/necessary?

It is not necessary. It has been removed


tests/SIL.Machine.Tests/Corpora/ParallelTextCorpusTests.cs line 1273 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

So why include this line?

I think it's not needed. I have removed it.


tests/SIL.Machine.Tests/Corpora/ParallelTextCorpusTests.cs line 1388 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

I may have misspoken - if there is a mismatch, we should fail and address the underlying issue. If it becomes too difficult, we can create a new issue to document the issue, but no one will ever look at the text output here. Please make the tests fail and if there are failures, we should understand the issue fully (and document in new issues) before knowledgeably removing tests.

Actually the expected behavior of the test is that it should fail. The mismatch is deliberate and failing is the right response if there is a mismatch. I don't think there is an underlying issue here. There would be if the test does not catch the mismatch. However, if I let the test fail, since it is a test that should fail, would it not affect the CI/CD?

@johnml1135
Copy link
Collaborator

tests/SIL.Machine.Tests/Corpora/ParallelTextCorpusTests.cs line 1388 at r1 (raw file):

Previously, mudiagaobrikisil wrote…

Actually the expected behavior of the test is that it should fail. The mismatch is deliberate and failing is the right response if there is a mismatch. I don't think there is an underlying issue here. There would be if the test does not catch the mismatch. However, if I let the test fail, since it is a test that should fail, would it not affect the CI/CD?

If a mismatch is the expected behavior, the test should only pass if there is a mismatch. Ideally, the main boundary matches and mismatches would be asserted in these tests.

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 6 files at r4, all commit messages.
Reviewable status: 34 of 35 files reviewed, 1 unresolved discussion (waiting on @johnml1135 and @mudiagaobrikisil)


tests/SIL.Machine.Tests/Corpora/ParallelTextCorpusTests.cs line 1388 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

If a mismatch is the expected behavior, the test should only pass if there is a mismatch. Ideally, the main boundary matches and mismatches would be asserted in these tests.

^

Copy link
Collaborator Author

@mudiagaobrikisil mudiagaobrikisil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 34 of 35 files reviewed, 1 unresolved discussion (waiting on @Enkidu93)


tests/SIL.Machine.Tests/Corpora/ParallelTextCorpusTests.cs line 1388 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

^

@johnml1135 . Thank you. I have edited the test to ensure that it passes only if it catches the matches and the mismatches correctly.

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 34 of 35 files reviewed, 1 unresolved discussion (waiting on @mudiagaobrikisil)


tests/SIL.Machine.Tests/Corpora/ParallelTextCorpusTests.cs line 1280 at r4 (raw file):

                "vers.txt",
                ScrVers.English,
                "customVersification"

Why change the name?

@johnml1135
Copy link
Collaborator

tests/SIL.Machine.Tests/Corpora/ParallelTextCorpusTests.cs line 1388 at r1 (raw file):

Previously, mudiagaobrikisil wrote…

@johnml1135 . Thank you. I have edited the test to ensure that it passes only if it catches the matches and the mismatches correctly.

Great - I look forward to seeing the changes.

Copy link
Collaborator

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 34 of 35 files reviewed, 2 unresolved discussions (waiting on @Enkidu93 and @mudiagaobrikisil)

investigate possible alignment and versification
issues with deuterocanonical books
@mudiagaobrikisil mudiagaobrikisil force-pushed the investigate_alignment_deuterocanonical_books branch from 83fc7a3 to e193846 Compare February 5, 2025 11:00
Copy link
Collaborator Author

@mudiagaobrikisil mudiagaobrikisil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 30 of 35 files reviewed, 2 unresolved discussions (waiting on @Enkidu93 and @johnml1135)


tests/SIL.Machine.Tests/Corpora/ParallelTextCorpusTests.cs line 1388 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Great - I look forward to seeing the changes.

Changes pushed.


tests/SIL.Machine.Tests/Corpora/ParallelTextCorpusTests.cs line 1280 at r4 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Why change the name?

This was an old change just to debug what the error was. I have refactored it to handle the error properly. The error was that it was looping through the books I specified and then trying to load a versification with the same key.

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed all commit messages.
Reviewable status: 30 of 35 files reviewed, 2 unresolved discussions (waiting on @johnml1135 and @mudiagaobrikisil)


tests/SIL.Machine.Tests/Corpora/ParallelTextCorpusTests.cs line 1280 at r4 (raw file):

Previously, mudiagaobrikisil wrote…

This was an old change just to debug what the error was. I have refactored it to handle the error properly. The error was that it was looping through the books I specified and then trying to load a versification with the same key.

OK, thank you 👍

@Enkidu93
Copy link
Collaborator

Enkidu93 commented Feb 5, 2025

Fixes sillsdev/serval#478

@johnml1135
Copy link
Collaborator

tests/SIL.Machine.Tests/Corpora/ParallelTextCorpusTests.cs line 1388 at r1 (raw file):

Previously, mudiagaobrikisil wrote…

Changes pushed.

Thanks for the updates Mudi! As I look at them, I notice that you are just counting the number of mismatches, not verifying each proper match or mismatch. Can you verify each one in a way that makes it clear what each match or mismatch is intending to verify (either self-documented or with comments as needed)?

@johnml1135
Copy link
Collaborator

Eli - just for my sake, can you place a comment next to each test that verifies each "issue" from the word doc - that is, the ones that need to be verified: https://docs.google.com/document/d/1lV3Sa0IDxFkx8wcQ73xeBg8hs9A60OstzvqvOCOPw9o/edit?tab=t.0?

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. @johnml1135! Issue 1, I think is covered at a book level indirectly many places but more directly in GetVersesInVersification_ButNotInSourceOrTarget and Issue 2 is covered by GetVersesInSourceOrTarget_ButNotInVersification. Is that right, Mudi? Without digging into the specific USFM files, it's hard to tell. Could you verify that that test covers both with specific examples.

Issues 3 and 4 don't require any action in the code. Issue 5, I believe, is already addressed in existing tests outside this PR indirectly. Would you like a test specific to the deuterocanon related to this, @johnml1135?

Issue 6: I'm not sure that this is covered; I'm also not 100% sure which cases this is meant to cover. Are you concerned about multiple verses mapping to the same verse?

Sorry about all the new items to address - I seemed to have missed some of the changes the first time through / forgot to get back to them. Thanks, Mudi!

Reviewable status: 30 of 35 files reviewed, 11 unresolved discussions (waiting on @mudiagaobrikisil)


tests/SIL.Machine.Tests/Corpora/ParallelTextCorpusTests.cs line 1179 at r5 (raw file):

            new MemoryText("Baruch", new[] { TextRow("Baruch", 5, "target segment 5 .") }),
            new MemoryText("1Maccabees", new[] { TextRow("1Maccabees", 6, "target segment 6 .") }),
            new MemoryText("2Maccabees", new[] { TextRow("2Maccabees", 7, "target segment 7 .") })

I don't believe that this is testing what you think it's testing since you're just passing a number as the row ref, not the actual refs that would exist for these books' rows.


tests/SIL.Machine.Tests/Corpora/ParallelTextCorpusTests.cs line 1193 at r5 (raw file):

    [Test]
    public void GetRows_AllDeuterocanonicalBooks_WithAlignments()

If you have this test, I don't think you really need the previous one at all.


tests/SIL.Machine.Tests/Corpora/ParallelTextCorpusTests.cs line 1332 at r5 (raw file):

            AllTargetRows = true
        };
        ParallelTextRow[] rows = parallelCorpus.ToArray();

You can just iterate through the parallelCorpus - no need to call ToArray here, I don't think.


tests/SIL.Machine.Tests/Corpora/ParallelTextCorpusTests.cs line 1414 at r5 (raw file):

    [TestCase("SUS", "SUS 1:1", Description = "Validate SUS Source and Target References")]
    [TestCase("BEL", "BEL 1:1", Description = "Validate BEL Source and Target References")]
    public void ValidateSourceAndTargetReferencesForDeuterocanonicals(string bookId, string verseRef)

It's conceivable that this could get messed up during alignment but more so in the case when either the target or source is missing. I think you could remove this test and put more specific asserts in the previous test to better cover this.


tests/SIL.Machine.Tests/Corpora/ParallelTextCorpusTests.cs line 1499 at r5 (raw file):

        }

        if (issues.Any())

Like above, this test should fail if there's an issue. You should have specific asserts.


tests/SIL.Machine.Tests/Corpora/ParallelTextCorpusTests.cs line 1741 at r5 (raw file):

        }

        Assert.Pass("Test completed");

Same as above. Have specific asserts, not just a report. I think there was some miscommunication about this, so no worries :). It should make things a lot simpler too!


tests/SIL.Machine.Tests/Corpora/CorporaTestHelpers.cs line 141 at r5 (raw file):

    }

    public static (string Book, int Chapter, int StartVerse, int EndVerse, bool IsSingleVerse) ParseRange(string range)

This can be made private


tests/SIL.Machine.Tests/Corpora/CorporaTestHelpers.cs line 162 at r5 (raw file):

    /// Removes unwanted characters in a corpus string.
    /// </summary>
    public static string CleanString(string input, string[] unwanted)

Remind me why this is needed? You only use this once to remove the division symbol, but since you're removing it from both source and target, couldn't you just as well leave it in and the comparison would still be the same?


tests/SIL.Machine.Tests/Corpora/CorporaTestHelpers.cs line 174 at r5 (raw file):

    /// Replace multiple spaces with a single space.
    /// </summary>
    public static string NormalizeSpaces(string input)

Same as above - not sure this is really needed although I like the idea of cleaning up the data. Also, even if you did want to clean the data, why not just normalize the data in the files instead of renormalizing it each time you test it?

Copy link
Collaborator Author

@mudiagaobrikisil mudiagaobrikisil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 30 of 35 files reviewed, 11 unresolved discussions (waiting on @Enkidu93 and @johnml1135)


tests/SIL.Machine.Tests/Corpora/CorporaTestHelpers.cs line 141 at r5 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

This can be made private

Done


tests/SIL.Machine.Tests/Corpora/CorporaTestHelpers.cs line 162 at r5 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Remind me why this is needed? You only use this once to remove the division symbol, but since you're removing it from both source and target, couldn't you just as well leave it in and the comparison would still be the same?

I removed it in the source in case it appeared in the source (just playing safe), but in actual sense, it appeared only in the target. That's why I couldn't leave it in, as that would mean the comparison would not be the same


tests/SIL.Machine.Tests/Corpora/CorporaTestHelpers.cs line 174 at r5 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Same as above - not sure this is really needed although I like the idea of cleaning up the data. Also, even if you did want to clean the data, why not just normalize the data in the files instead of renormalizing it each time you test it?

I didn't know if it was right for me to do that in the SFM files, which was why I just handled it in the code. What I think would be better is making the NormalizeSpaces part of the CleanString since the normalization occurs the same places I needed to clean the string. What do you think @Enkidu93 ?


tests/SIL.Machine.Tests/Corpora/ParallelTextCorpusTests.cs line 1388 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Thanks for the updates Mudi! As I look at them, I notice that you are just counting the number of mismatches, not verifying each proper match or mismatch. Can you verify each one in a way that makes it clear what each match or mismatch is intending to verify (either self-documented or with comments as needed)?

Made some changes


tests/SIL.Machine.Tests/Corpora/ParallelTextCorpusTests.cs line 1179 at r5 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

I don't believe that this is testing what you think it's testing since you're just passing a number as the row ref, not the actual refs that would exist for these books' rows.

This test has been removed


tests/SIL.Machine.Tests/Corpora/ParallelTextCorpusTests.cs line 1193 at r5 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

If you have this test, I don't think you really need the previous one at all.

I agree.


tests/SIL.Machine.Tests/Corpora/ParallelTextCorpusTests.cs line 1332 at r5 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

You can just iterate through the parallelCorpus - no need to call ToArray here, I don't think.

You are correct. Thanks


tests/SIL.Machine.Tests/Corpora/ParallelTextCorpusTests.cs line 1414 at r5 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

It's conceivable that this could get messed up during alignment but more so in the case when either the target or source is missing. I think you could remove this test and put more specific asserts in the previous test to better cover this.

I have removed this and I am refactoring the above test to better cover this


tests/SIL.Machine.Tests/Corpora/ParallelTextCorpusTests.cs line 1499 at r5 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Like above, this test should fail if there's an issue. You should have specific asserts.

Done


tests/SIL.Machine.Tests/Corpora/ParallelTextCorpusTests.cs line 1741 at r5 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Same as above. Have specific asserts, not just a report. I think there was some miscommunication about this, so no worries :). It should make things a lot simpler too!

Done

@johnml1135
Copy link
Collaborator

Thanks Eli for doing the review. Issues 1-5 I am satisfied with your response on. For Issue 6 (mapping onto existing verses), I would like at least one test to verify the actual behavior (whatever it is) and to make sure that it does not crash. An example test is in the doc for Option B - Gen 1:3-4 mapping onto Gen 1:1-2.

@johnml1135
Copy link
Collaborator

tests/SIL.Machine.Tests/Corpora/ParallelTextCorpusTests.cs line 1388 at r1 (raw file):

Previously, mudiagaobrikisil wrote…

Made some changes

Thank you for the updates, but I would really like the mismatches spelled out that are expected, not just counted.

Copy link
Collaborator Author

@mudiagaobrikisil mudiagaobrikisil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi John. Thanks. I will add a test for that.

Reviewable status: 28 of 35 files reviewed, 11 unresolved discussions (waiting on @Enkidu93)

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed all commit messages.
Reviewable status: 28 of 35 files reviewed, 7 unresolved discussions (waiting on @mudiagaobrikisil)


tests/SIL.Machine.Tests/Corpora/CorporaTestHelpers.cs line 162 at r5 (raw file):

Previously, mudiagaobrikisil wrote…

I removed it in the source in case it appeared in the source (just playing safe), but in actual sense, it appeared only in the target. That's why I couldn't leave it in, as that would mean the comparison would not be the same

See comment below.


tests/SIL.Machine.Tests/Corpora/CorporaTestHelpers.cs line 174 at r5 (raw file):

Previously, mudiagaobrikisil wrote…

I didn't know if it was right for me to do that in the SFM files, which was why I just handled it in the code. What I think would be better is making the NormalizeSpaces part of the CleanString since the normalization occurs the same places I needed to clean the string. What do you think @Enkidu93 ?

I think it's reasonable to combine those two functions. I'm still not convinced that removing the division signs needs to happen at all. If the source and target in the test are the same corpus, then they will both have the division symbols, so there's no reason to remove them. If they're different, then even after you remove the division symbols, they'll still be different. What am I missing?


tests/SIL.Machine.Tests/Corpora/ParallelTextCorpusTests.cs line 1179 at r5 (raw file):

Previously, mudiagaobrikisil wrote…

This test has been removed

Thank you! Great.


tests/SIL.Machine.Tests/Corpora/ParallelTextCorpusTests.cs line 1414 at r5 (raw file):

Previously, mudiagaobrikisil wrote…

I have removed this and I am refactoring the above test to better cover this

Sounds good.


tests/SIL.Machine.Tests/Corpora/ParallelTextCorpusTests.cs line 1741 at r5 (raw file):

Previously, mudiagaobrikisil wrote…

Done

Like John was saying above, specific asserts about the mappings is what we want

Copy link
Collaborator Author

@mudiagaobrikisil mudiagaobrikisil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 28 of 35 files reviewed, 7 unresolved discussions (waiting on @Enkidu93)


tests/SIL.Machine.Tests/Corpora/CorporaTestHelpers.cs line 174 at r5 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

I think it's reasonable to combine those two functions. I'm still not convinced that removing the division signs needs to happen at all. If the source and target in the test are the same corpus, then they will both have the division symbols, so there's no reason to remove them. If they're different, then even after you remove the division symbols, they'll still be different. What am I missing?

Yes you are correct. The issue was from the initial source and target files Peter sent. Removing it made the test fail as only the target had the division sign. For the new files he sent, just tested and there was no need to remove the signs.

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 28 of 35 files reviewed, 7 unresolved discussions (waiting on @mudiagaobrikisil)


tests/SIL.Machine.Tests/Corpora/CorporaTestHelpers.cs line 174 at r5 (raw file):

Previously, mudiagaobrikisil wrote…

Yes you are correct. The issue was from the initial source and target files Peter sent. Removing it made the test fail as only the target had the division sign. For the new files he sent, just tested and there was no need to remove the signs.

Great! Thank you - sounds good.

Copy link
Collaborator Author

@mudiagaobrikisil mudiagaobrikisil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 28 of 35 files reviewed, 7 unresolved discussions (waiting on @Enkidu93 and @johnml1135)


tests/SIL.Machine.Tests/Corpora/ParallelTextCorpusTests.cs line 1388 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Thank you for the updates, but I would really like the mismatches spelled out that are expected, not just counted.

When you said spelled out, what do you mean

Copy link
Collaborator

@Enkidu93 Enkidu93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 28 of 35 files reviewed, 7 unresolved discussions (waiting on @johnml1135 and @mudiagaobrikisil)


tests/SIL.Machine.Tests/Corpora/ParallelTextCorpusTests.cs line 1388 at r1 (raw file):

Previously, mudiagaobrikisil wrote…

When you said spelled out, what do you mean

Add asserts that describe the specific 'mismatches' - i.e., as you iterate or just via indexing, add asserts that describe specific expectations about the source and target in that row.

Copy link
Collaborator Author

@mudiagaobrikisil mudiagaobrikisil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 28 of 35 files reviewed, 7 unresolved discussions (waiting on @Enkidu93 and @johnml1135)


tests/SIL.Machine.Tests/Corpora/ParallelTextCorpusTests.cs line 1388 at r1 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Add asserts that describe the specific 'mismatches' - i.e., as you iterate or just via indexing, add asserts that describe specific expectations about the source and target in that row.

Okay. Understood.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants