Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

right to left and left to right Orientation of the Persian digital copies in the full-text display on IIIF, DFG Viewer and PDF #54

Open
MaidaButtar opened this issue Nov 9, 2023 · 8 comments
Labels
enhancement New feature or request
Milestone

Comments

@MaidaButtar
Copy link

The following problems occur when recognizing and displaying the left-to-right and right-to-left orientation in the full-text display of the IIIF and DFG Viewer and in the PDF files in Persian:

IIIF Viewer: The order of the words is correct, but the letters in the words are reversed. The order of digits is correct (reason: numbers are read from left to right)

DFG Viewer: Line breaks are all gone, the order of words is halfway correct, but again, the letters in the word are reversed. The order of digits is correct.

PDF: Order of words reversed at line level, but letters in the word are not reversed. Order of digits correct..

DFG VIEWER
IIIF VIEWER
PDF

@MaidaButtar
Copy link
Author

Examples taken from: https://opendata.uni-halle.de/handle/1981185920/88120

@M3ssman
Copy link
Member

M3ssman commented Nov 10, 2023

Unfortunately the problem originates from the data itself, which contains the letters already in reversed order.
The workflow problem is related to ulb-sachsen-anhalt/ocrd-odem#14, but hopefully this is gone by now.

Still, we must face the PDF text layer.

@M3ssman
Copy link
Member

M3ssman commented Nov 15, 2023

Regarding the word level representation in the online viewers, they are out of scope of this tool.
Since the current OCR-run using ODEM is going to produce proper ordered characters, this will be fixed as soon as possible.

@MaidaButtar
Can you please take a look into the PDF files again?

It seems to me that the rendered characters in the outline to navigate between sections / chapters (usually displayed at the left part of a PDF-viewer, like Firefox Browser) are properly ordered.

@MaidaButtar
Copy link
Author

I checked the PDF files and now both known cases have occurred that not only is the order of the letters in the word inverted, but so is the order of the words. In other words, the first word is at the end of the line.

And it is correct, the subdivision of sections, chapters on the left is ordered properly on the PDF viewer.

@M3ssman
Copy link
Member

M3ssman commented Nov 16, 2023

@MaidaButtar Can you please try these cases and report their results:

  • What happens, if you search the PDF for a word displayed rather properly in the navigation outline (say, word is part of chapter heading)?
  • What happens if you search PDF file for a different word that should be part of the text layer content of a specific page in character inverted notation?

And exactly which PDF-reader tool are you using?

@MaidaButtar
Copy link
Author

@M3ssman
I tested both Adobe Acrobat Reader, and also the PDF Viewer in Firefox browser.

  1. when you search for a word which is displayed correctly in the navigation there are two cases.
  • searching for a word will result in the word being displayed in the wrong order. The searched word is not found and displayed in the headings, but the reverse variant is found and marked in the text. But it is not the reversed word which is marked, but some other. But if you look closely, you can see that the word is on the same line.

-nothing is found as if there is no match

  1. In this case, the words are displayed reversed. Again, it is not the exact word that is marked, but some word, but if you look closely, you can see that the word is on the same line, only reversed,

@MaidaButtar
Copy link
Author

example:
the word نام is searched for and displayed reversed. Thus, مان is displayed in the full text.

@M3ssman M3ssman pinned this issue Nov 29, 2023
@M3ssman
Copy link
Member

M3ssman commented Jan 2, 2024

To give an update:

  • there has been a test integrated, the order of chars extracted is unfortunately not equals to the bytes order in the text of the PDF data
  • the used Library for PDF-processing, iText5, provides no means to manipulate text direction, which is called there run direction but isn't used actually.

Therefore I'm afraid this issue is tied to the overall update of PDF generation (Next Version PDF Processing).

@M3ssman M3ssman added this to the 2.x.x milestone Nov 1, 2024
@M3ssman M3ssman added the enhancement New feature or request label Nov 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants