Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not translate markers #281

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open

Conversation

johnml1135
Copy link
Collaborator

@johnml1135 johnml1135 commented Jan 17, 2025

This attempts to address the following issues:

It also reworks the Note handling, making it so that it preserves the reference fields.
It also combines Notes, Figures and Cross References as "SubComponents"

There are still a few tests that need fixed.

There are significant parsing updates - and the tests were updated to match the updated behavior. @ddaspit - can you check to see if they are going in the right direction?


This change is Reviewable

@johnml1135 johnml1135 requested a review from ddaspit January 17, 2025 22:03
@johnml1135 johnml1135 force-pushed the do_not_translate_markers branch from 034e8bc to 03adec2 Compare January 18, 2025 01:03
Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the USFM parsing classes are based on the implementations used in Paratext. As a result, I don't want to make any major changes to the way that these classes work. This ensures that Machine has the same level of compatibility as Paratext. It also makes it easier to port fixes over from Paratext. It would also introduce breaking changes to classes that other codebases use. It would be good if you could refactor this so that it builds on top of the USFM parsing classes rather than changing them.

Reviewable status: 0 of 17 files reviewed, all discussions resolved

@johnml1135
Copy link
Collaborator Author

I will have to think about how to implement this, but from what I am hearing, the "SubComponent" method of thinking and working is fine, but it should be implemented in the handlers, not the parsers directly.

If that is the case, which files directly mirror the Paratext files (the USFM Tokenizer and USFM parser)? Can I get access to those Paratext files? Should we put a comment in the files saying that they are intended to mirror that parsing?

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is correct. It should be implemented in a handler.

There are three sets of classes that are based off of Paratext:

  1. UsfmStylesheet, UsfmTag
  2. UsfmTokenizer, UsfmToken, UsfmAttribute
  3. UsfmParser, UsfmParserState, IUsfmParserHandler

Paratext 9 is closed-source, so it is hard to directly link them to the corresponding code in Paratext. We can put comments on these classes to indicate that they are implemented to mimic the behavior of Paratext, so change them with caution.

Reviewable status: 0 of 17 files reviewed, all discussions resolved

@johnml1135
Copy link
Collaborator Author

That sounds good - I will work on the changes that you are requesting, will add the comments to the top of the files and try to fix the last few bugs. Other than that, I will assume that the general direction for the code is good.

I may also try to add in sillsdev/serval#604 - making sure that tables are handled properly.

@johnml1135 johnml1135 force-pushed the do_not_translate_markers branch from 21864af to 20d0768 Compare January 23, 2025 21:46
Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you still working on this? I thought we decided to not change the behavior of the USFM parser classes.

Reviewed all commit messages.
Reviewable status: 0 of 17 files reviewed, all discussions resolved

@johnml1135
Copy link
Collaborator Author

Yes, I am refactoring it now. I just pushed some code to redo the naming first.

@johnml1135 johnml1135 force-pushed the do_not_translate_markers branch from 20d0768 to 1451ae2 Compare January 24, 2025 20:15
@codecov-commenter
Copy link

codecov-commenter commented Jan 24, 2025

Codecov Report

Attention: Patch coverage is 97.50000% with 4 lines in your changes missing coverage. Please review.

Project coverage is 70.33%. Comparing base (3a9b17e) to head (265f8b8).

Files with missing lines Patch % Lines
...chine/Corpora/ScriptureRefUsfmParserHandlerBase.cs 96.55% 1 Missing and 2 partials ⚠️
src/SIL.Machine/Corpora/UpdateUsfmParserHandler.cs 98.36% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #281      +/-   ##
==========================================
+ Coverage   70.28%   70.33%   +0.05%     
==========================================
  Files         385      385              
  Lines       32019    32121     +102     
  Branches     4504     4523      +19     
==========================================
+ Hits        22503    22592      +89     
- Misses       8471     8483      +12     
- Partials     1045     1046       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@johnml1135
Copy link
Collaborator Author

@ddaspit - I made the updates. All tests pass. Please make sure that the "intent" is correct.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't gone through everything yet, but it is looking good so far. How does this align with @isaac091's work on preserving USFM markers? Will we need to redo the work here when that comes in? It might make sense to wait for those changes to come in before we do this, that way we can reduce churn and ensure that we have the right design.

One thing to note is that the term is "embed" and not "embedded".

Reviewed 6 of 14 files at r3, 1 of 3 files at r4.
Reviewable status: 7 of 17 files reviewed, all discussions resolved

@johnml1135
Copy link
Collaborator Author

I talked with @isaac091 and the code looked fairly different - we will need to collaborate hand haven't done so fully yet. After this goes through the first review, I will do the parallel Machine.py updates and then we can merge in both at the same time.

@johnml1135 johnml1135 force-pushed the do_not_translate_markers branch from db5a843 to 0b82cf0 Compare January 30, 2025 16:34
Only translate ft text
Configure preserving / stripping embedded and style markers

Embed
@johnml1135 johnml1135 force-pushed the do_not_translate_markers branch from 0b82cf0 to 6108e67 Compare January 30, 2025 16:42
* Nested embeds
* If we strip embeds and there is an updated embed, strip it
All tests pass
@johnml1135 johnml1135 force-pushed the do_not_translate_markers branch from d08423a to ed41aef Compare February 11, 2025 13:37
@johnml1135
Copy link
Collaborator Author

@ddaspit - All machine and machine.py updates have been made - they match each other and all tests pass.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 17 files at r1, 1 of 14 files at r3, 14 of 14 files at r5, all commit messages.
Reviewable status: all files reviewed, 10 unresolved discussions (waiting on @johnml1135)


src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 25 at r5 (raw file):

        private bool _inEmbed;
        public bool InNoteText { get; private set; }

This should be protected.


src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 176 at r5 (raw file):

        }

        public void StartEmbed(UsfmParserState state, string marker)

This method should be protected.


src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 191 at r5 (raw file):

        }

        public virtual void StartEmbed(UsfmParserState state, ScriptureRef scriptureRef) { }

This method should be protected.


src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 193 at r5 (raw file):

        public virtual void StartEmbed(UsfmParserState state, ScriptureRef scriptureRef) { }

        public virtual void EndEmbed(

This method should be protected.


src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 271 at r5 (raw file):

        protected virtual void EndNonVerseText(UsfmParserState state, ScriptureRef scriptureRef) { }

        public virtual void StartNoteTextWrapper(UsfmParserState state)

This method should be protected. Also, this method is named inconsistently with the corresponding EndNoteText method.


src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 280 at r5 (raw file):

        protected virtual void StartNoteText(UsfmParserState state) { }

        public virtual void EndNoteText(UsfmParserState state)

This method should be protected.


src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 384 at r5 (raw file):

        }

        public bool InEmbed(string marker)

This should be protected. Also, it should be named IsInEmbed.


src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 389 at r5 (raw file):

        }

        public bool IsInNestedEmbed(string marker)

This should be protected.


src/SIL.Machine/Corpora/UsfmStylesheet.cs line 114 at r3 (raw file):

        }

        private static IEnumerable<string> GetEmbeddedStylesheet(string fileName)

It looks like you accidentally renamed this method.


src/SIL.Machine/Corpora/UpdateUsfmParserHandler.cs line 14 at r5 (raw file):

    }

    public enum UpdateUsfmIntraVerseMarkerBehavior

This should be named UpdateUsfmMarkerBehavior.

@johnml1135
Copy link
Collaborator Author

src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 25 at r5 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This should be protected.

Done.

@johnml1135
Copy link
Collaborator Author

src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 176 at r5 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This method should be protected.

Done.

@johnml1135
Copy link
Collaborator Author

src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 191 at r5 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This method should be protected.

Done.

@johnml1135
Copy link
Collaborator Author

src/SIL.Machine/Corpora/ScriptureRefUsfmParserHandlerBase.cs line 193 at r5 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This method should be protected.

Done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants