Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tools=>Illustration Fixup #595

Closed
okrick opened this issue Dec 23, 2024 · 6 comments · Fixed by #644
Closed

Tools=>Illustration Fixup #595

okrick opened this issue Dec 23, 2024 · 6 comments · Fixed by #644
Labels
core feature Required for basic PPing

Comments

@okrick
Copy link

okrick commented Dec 23, 2024

The current GG2 Illustration Fixup tool has limitations:

Limited Detection of Mid-Paragraph Illustrations: It only reliably identifies mid-paragraph Illustrations if they are explicitly marked with an asterisk before the Illustration tag (e.g., *[Illustration...).

Failure to Detect Page Break Interruptions: The tool fails to recognize instances where a paragraph is interrupted by a page break, followed by an Illustration, and potentially more page breaks before the paragraph resumes.

Dependence on Manual Asterisk Placement: Proofreaders often omit the necessary asterisk when the Illustration occupies an entire page, making these instances difficult for the tool to detect.

Addressing these limitations would require enhancements to the GG2 Illustration Fixup tool:

Improved Contextual Analysis: The tool could be enhanced to analyze paragraph flow across page breaks, considering the presence of Illustrations as potential interruptions.

Current Workaround:

To manually identify potential paragraph interruptions caused by Illustrations, I currently use the following search term: ^-+[^\n+]+-\n+\*?\[(Illustration|Music)

This search term helps locate images that might be breaking paragraphs by targeting images at the top of pages.

Note: I realize a complete solution might not be feasible. I only ask that the problem be given some consideration.

henry.txt
henry.txt.json

@windymilla
Copy link
Collaborator

Thanks for the suggestions, Rick.

Notes (partly for whoever looks at this)

  1. Rick's regex relies on the "----- File:... -----" page break lines still being in the file at the time you use it.
  2. I believe it just finds illos at the top of a page, but doesn't detect if they are mid-paragraph
  3. There are some cases it's not possible to detect, e.g.
this is the final line of a paragraph (or is it mid-paragraph) - who can tell?
-----File: 024.png---------------------------------------------------------

[Illustration]

This is the next line of text, but is it a new paragraph, or is the blank line
above to set it off from the illo? 
-----File: 025.png---------------------------------------------------------

@okrick
Copy link
Author

okrick commented Dec 24, 2024

It may never be possible to catch all but more would be helpful. Relying on the asterisk alone misses far too many and may leave the PPer surprised later. I was astonished when only one was found in the entire file (already corrected before I zipped it).

Perhaps the better solution may be to provide a manual tool similar to the GG1 Tools=>Character Tools=>Search for Transliterations or the GG2 Tools=>Stealth Scannos tools. And skip identifying the asterisks in the illustration check--there are several other checks for asterisks in the menus, e.g. Search=>Find Asterisk w/o Slash.

@windymilla
Copy link
Collaborator

I composed a reply, but obviously never clicked the final "Comment" button to post it.
My suggestion is that we improve the checking to catch the following case:

this is the final line of a paragraph (or is it mid-paragraph) - who can tell?
-----File: 024.png---------------------------------------------------------

[Illustration]
-----File: 025.png---------------------------------------------------------

-----File: 026.png---------------------------------------------------------
This is the next line of text, but is it a new paragraph, or is the blank line
above to set it off from the illo? 

So the algorithm would be:

  1. Look forward from the Illo markup, skipping blank lines and "-----File" lines until we find the next bit of "real" text.
  2. If there is not a blank line immediately before the "real" text, the illo is mid-paragraph.

I think this would catch a lot of the cases where the formatter hasn't marked it as a mid-para illo, and shouldn't produce false positives.

@okrick
Copy link
Author

okrick commented Dec 26, 2024

I concur. While it might miss a few instances, this approach should significantly improve the situation.

One potential limitation is the program's ability to accurately identify instances where multiple illustrations occur consecutively. I'll leave the decision regarding the feasibility of implementing a longer lookahead mechanism to your discretion.

this is the final line of a paragraph (or is it mid-paragraph) - who can tell?
-----File: 024.png---------------------------------------------------------

[Illustration]
-----File: 025.png---------------------------------------------------------

-----File: 026.png---------------------------------------------------------

[Illustration]
-----File: 027.png---------------------------------------------------------

-----File: 028.png---------------------------------------------------------
This is the next line of text, but is it a new paragraph, or is the blank line

@windymilla windymilla added the core feature Required for basic PPing label Dec 27, 2024
@windymilla
Copy link
Collaborator

@okrick - I've changed the code to cope with the above situation of multiple illos & blank lines, and also with illos that span more than one line, and even even ones that have blank lines within the caption, so in the next release you would get the following illos all reported as being MID-PARAGRAPH:


At the end of the Yser battle, after the 29^{th} of
October 1914, Oud-Stuyvekenskerke was only occupied for
-----File: 031.png---------------------------------------------------------

[Illustration: <sc>Dixmude.</sc>--Aerial photo (Mai 26^{th} 1917).]
-----File: 032.png---------------------------------------------------------

[Illustration: <sc>Dixmude.</sc>--Their Majesties King and Queen at the "Death trench".
(June 1^{st} 1917).]

[Illustration: <sc>Dixmude.</sc>--Their Majesties King and Queen at the Riderswork.
(June 1^{st} 1917).

/#
The Queen examining private J. Vermeire's helmet,
which had just been pierced by a German bullet.
#/
]
-----File: 033.png---------------------------------------------------------
a few days by weak German detachements, whilst our line
of defence had been brought back upon the Nieuport-Dixmude
railway line and rejoining the Yser at the

windymilla added a commit to windymilla/guiguts-py that referenced this issue Jan 5, 2025
It now looks forward to find the first "normal" line, i.e.
not an empty line, a `[Blank Page]`, another illo/SN, nor
a page separator line. Then it finds a normal line, it checks
if the line above it is blank, meaning it's the start of a
paragraph. If not, then the illo/SN is mid-paragraph.

Fixes DistributedProofreaders#595
@okrick
Copy link
Author

okrick commented Jan 5, 2025

Wow, that's quite an accomplishment.

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core feature Required for basic PPing
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants