right to left and left to right Orientation of the Persian digital copies in the full-text display on IIIF, DFG Viewer and PDF #54

MaidaButtar · 2023-11-09T11:25:51Z

The following problems occur when recognizing and displaying the left-to-right and right-to-left orientation in the full-text display of the IIIF and DFG Viewer and in the PDF files in Persian:

IIIF Viewer: The order of the words is correct, but the letters in the words are reversed. The order of digits is correct (reason: numbers are read from left to right)

DFG Viewer: Line breaks are all gone, the order of words is halfway correct, but again, the letters in the word are reversed. The order of digits is correct.

PDF: Order of words reversed at line level, but letters in the word are not reversed. Order of digits correct..

MaidaButtar · 2023-11-09T11:31:58Z

Examples taken from: https://opendata.uni-halle.de/handle/1981185920/88120

M3ssman · 2023-11-10T06:11:51Z

Unfortunately the problem originates from the data itself, which contains the letters already in reversed order.
The workflow problem is related to ulb-sachsen-anhalt/ocrd-odem#14, but hopefully this is gone by now.

Still, we must face the PDF text layer.

M3ssman · 2023-11-15T06:47:32Z

Regarding the word level representation in the online viewers, they are out of scope of this tool.
Since the current OCR-run using ODEM is going to produce proper ordered characters, this will be fixed as soon as possible.

@MaidaButtar
Can you please take a look into the PDF files again?

It seems to me that the rendered characters in the outline to navigate between sections / chapters (usually displayed at the left part of a PDF-viewer, like Firefox Browser) are properly ordered.

MaidaButtar · 2023-11-15T09:06:01Z

I checked the PDF files and now both known cases have occurred that not only is the order of the letters in the word inverted, but so is the order of the words. In other words, the first word is at the end of the line.

And it is correct, the subdivision of sections, chapters on the left is ordered properly on the PDF viewer.

M3ssman · 2023-11-16T06:37:46Z

@MaidaButtar Can you please try these cases and report their results:

What happens, if you search the PDF for a word displayed rather properly in the navigation outline (say, word is part of chapter heading)?
What happens if you search PDF file for a different word that should be part of the text layer content of a specific page in character inverted notation?

And exactly which PDF-reader tool are you using?

MaidaButtar · 2023-11-16T09:47:39Z

@M3ssman
I tested both Adobe Acrobat Reader, and also the PDF Viewer in Firefox browser.

when you search for a word which is displayed correctly in the navigation there are two cases.

searching for a word will result in the word being displayed in the wrong order. The searched word is not found and displayed in the headings, but the reverse variant is found and marked in the text. But it is not the reversed word which is marked, but some other. But if you look closely, you can see that the word is on the same line.

-nothing is found as if there is no match

In this case, the words are displayed reversed. Again, it is not the exact word that is marked, but some word, but if you look closely, you can see that the word is on the same line, only reversed,

MaidaButtar · 2023-11-16T09:50:47Z

example:
the word نام is searched for and displayed reversed. Thus, مان is displayed in the full text.

M3ssman · 2024-01-02T08:23:45Z

To give an update:

there has been a test integrated, the order of chars extracted is unfortunately not equals to the bytes order in the text of the PDF data
the used Library for PDF-processing, iText5, provides no means to manipulate text direction, which is called there run direction but isn't used actually.

Therefore I'm afraid this issue is tied to the overall update of PDF generation (Next Version PDF Processing).

M3ssman pinned this issue Nov 29, 2023

M3ssman added this to the 2.x.x milestone Nov 1, 2024

M3ssman added the enhancement New feature or request label Nov 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

right to left and left to right Orientation of the Persian digital copies in the full-text display on IIIF, DFG Viewer and PDF #54

right to left and left to right Orientation of the Persian digital copies in the full-text display on IIIF, DFG Viewer and PDF #54

MaidaButtar commented Nov 9, 2023

MaidaButtar commented Nov 9, 2023

M3ssman commented Nov 10, 2023

M3ssman commented Nov 15, 2023

MaidaButtar commented Nov 15, 2023

M3ssman commented Nov 16, 2023

MaidaButtar commented Nov 16, 2023

MaidaButtar commented Nov 16, 2023

M3ssman commented Jan 2, 2024

right to left and left to right Orientation of the Persian digital copies in the full-text display on IIIF, DFG Viewer and PDF #54

right to left and left to right Orientation of the Persian digital copies in the full-text display on IIIF, DFG Viewer and PDF #54

Comments

MaidaButtar commented Nov 9, 2023

MaidaButtar commented Nov 9, 2023

M3ssman commented Nov 10, 2023

M3ssman commented Nov 15, 2023

MaidaButtar commented Nov 15, 2023

M3ssman commented Nov 16, 2023

MaidaButtar commented Nov 16, 2023

MaidaButtar commented Nov 16, 2023

M3ssman commented Jan 2, 2024