-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NPE on getImagePath if LOCTYPE=URL #39
Comments
I (downloaded and) inserted secondary FLocat with |
No, even after removing the remote URLs completely (so only the local FLocats would remain for MAX and FULLTEXT), it crashes the same. |
Ok, so perhaps the tool expects the directory name to be MAX, not just the |
No, not even that works. How do you use this tool? |
Thank you for trying out! Since our Workflow uses not the OCR-D-METS itself, I'm not aware of these issues. It served well +250k times in past 1.5 year to process digital objects pulled via OAI from Visual Library Server (versions range from 2012 - 2022.06) and opendata (DSpace 6). (The latter being hopefully ocr'd by OCR-D) METS like these are both DDB-valid and processable by Derivans, if the image content of the fileGroup Another scenario just uses a flat tree without any METS but at least a One can add lots of processing-steps in a configuration (which per default is expected to be in |
I already tried that – see above. Let me elaborate. This is the filesystem:
The METS references these as local hrefs: <mets:fileGrp USE="MAX">
<mets:file ID="FILE_0001_MAX" MIMETYPE="image/jpeg">
<mets:FLocat LOCTYPE="OTHER" OTHERLOCTYPE="FILE" xlink:href="MAX/00000001.tif.large.jpg"/>
</mets:file>
<mets:file ID="FILE_0002_MAX" MIMETYPE="image/jpeg">
<mets:FLocat LOCTYPE="OTHER" OTHERLOCTYPE="FILE" xlink:href="MAX/00000002.tif.large.jpg"/>
</mets:file>
...
<mets:file ID="FILE_0536_MAX" MIMETYPE="image/jpeg">
<mets:FLocat LOCTYPE="OTHER" OTHERLOCTYPE="FILE" xlink:href="MAX/00000536.tif.large.jpg"/>
</mets:file>
</mets:fileGrp>
<mets:fileGrp USE="FULLTEXT">
<mets:file ID="FILE_0005_FULLTEXT" MIMETYPE="text/xml">
<mets:FLocat LOCTYPE="OTHER" OTHERLOCTYPE="FILE" xlink:href="FULLTEXT/FILE_0005_FULLTEXT.xml"/>
</mets:file>
...
<mets:file ID="FILE_0533_FULLTEXT" MIMETYPE="text/xml">
<mets:FLocat LOCTYPE="OTHER" OTHERLOCTYPE="FILE" xlink:href="FULLTEXT/FILE_0533_FULLTEXT.xml"/>
</mets:file>
</mets:fileGrp> I am running digital-derivans like this:
What else am I expected to do to fit your profile? |
Running with just the directory name does produce a PDF file. It is 535 MB, and all pages look like grainy rainbows:
Also, I tried with non-OCR-D (straight out of Kitodo.Presentation / DFG-Viewer) already. Same problem! |
Ok, so how do I make this work?
|
Regarding your first attempt: The setup and call look quite reasonable, with the difference that in our workflows the OCR-File and the Image match exactly by name. But IIRC (https://github.com/ulb-sachsen-anhalt/digital-derivans/blob/master/src/main/java/de/ulb/digital/derivans/data/MetadataStore.java#L86), they are picked by the physical links in each physical sub division, which I did not recognize in your example. Regarding the second one: I did in a fresh venv with ocrd 2.49.0 |
you mean the base name, excluding the suffix? IMO that would be unrealistic and overly strict.
I did not show the physical structMap. It looks like this: <mets:structMap TYPE="PHYSICAL">
<mets:div ID="PHYS_0000" TYPE="physSequence">
<mets:fptr FILEID="FULLDOWNLOAD"/>
<mets:div ID="PHYS_0001" ORDER="1" ORDERLABEL=" - " TYPE="page">
<mets:fptr FILEID="FILE_0001_THUMBS"/>
<mets:fptr FILEID="FILE_0001_DOWNLOAD"/>
<mets:fptr FILEID="FILE_0001_MIN"/>
<mets:fptr FILEID="FILE_0001_DEFAULT"/>
<mets:fptr FILEID="FILE_0001_MAX"/>
<mets:fptr FILEID="FILE_0001_ORIGINAL"/>
</mets:div>
<mets:div ID="PHYS_0002" ORDER="2" ORDERLABEL=" - " TYPE="page">
<mets:fptr FILEID="FILE_0002_THUMBS"/>
<mets:fptr FILEID="FILE_0002_DOWNLOAD"/>
<mets:fptr FILEID="FILE_0002_MIN"/>
<mets:fptr FILEID="FILE_0002_DEFAULT"/>
<mets:fptr FILEID="FILE_0002_MAX"/>
<mets:fptr FILEID="FILE_0002_ORIGINAL"/>
</mets:div>
... So, judging by the code, I guess the How do you run with debug logging?
Like I said, I don't know what to expect. You said you have used this thousands of times on METS in your presentation. Presentation METS usually only have URLs. I already documented my odysee trying various combinations of remote and local hrefs above. If I do an additional …
… (which replaces URLs with local path refs), then it works (but without text layer). |
To clear out, the processing used by OCR-D-ODEM is based on a list of OAI-Record-URNs and works as follows:
Back to usage: Probably I've tried to follow your way like this (with smaller print
This in my case, creates a PDF with text layer, outline and metadata. Probably you can get more information with a sample config. One is located at In production environments it's ensured, that required configs are located in a sub directory
|
It was the problem with my own dataset. After stripping the existing FULLDOWNLOAD fptr (together with all the other steps described above), derivans does process the METS. Unfortunately, I end up with the same broken result I get when just passing the directory: garbage rainbows without any text.
Thanks for that explanation. So you are not using the METS yourself, only the directory. What I still find missing is at what point you download the MAX images. Do you just copy them over from the OCR-D workspaces (together with the FULLTEXT files)?
In my case? No. See above.
Like I said, I don't get a text layer, regardless of what dataset (yours or mine). Perhaps I need some configuration file? So what setting there influences whether or not a text layer gets added? |
There's no additional setting required. If config (c.f. above) and logging are in place, please inspect the log for messages like:
This means that for page 4 no OCR data was present (nothing enriched) but starting with page 5 the metadata points to OCR files. In this test case it's OCR-D transformed ALTO At PDF creation time it is processed like this:
which indicate that recognition and parsing took place and that OCR data had to be scaled to match the also scaled target image. This is because we scale images for PDF to reduce size for reading on screen. Output like Maybe the OCR-data isn't properly recognized? Is it possible to provide some test data for analytical purposes? |
Ok, we are getting there. I have copied Now I can see log messages. A new problem arose after the first successful run (with the garbled colours): when digital-derivans added the PDF to my METS, it created invalid identifiers! Looks like the file name and file ID is based on whatever
|
Also, digital-derivans converted my METS from LF to CRLF convention for EOL. It's debatable whether this is still correct, but it's unexpected. Another problem: the PDF gets referenced as fptr in the logical structMap. That's plain wrong according to DFG profile – it should be in the physical structMap. |
The problem is in IMO it should simply look for But even then – to use this directly as PDF file name and XML identifier for it is just wrong. It should at least convert colons to underscores. See |
The decisions whom to use for what purpose reflects our inhouse workflows.
I've tried to do this in the README, but it seems to be unclear. Please add critical remarks and open a PR to help this out.
What DFG- profile do you mean? This insertion is DDB-valid. Further, it is correctly displayed and linked in the DFG-Viewer. Try a digital object from Share_it or Share_DIGit, they are almost all done this way. Also, enterprise components like visual library or zeutschel do it like this. |
Oh sorry, I remember reading this now. It looked complicated... At least a reference to the example configs under
I meant the DFG profile for METS. But now that I went looking, surprisingly I cannot find any specifics for PDF in there, except for the mention of the dedicated It did enter the OCR-D spec on METS though. There it says to use fptr in the top-level div of the physical structMap. Looking at the code base for DFG Viewer, Kitodo.Presentation, it appears like both are supported: fptr under physical and fptr under logical. I am somewhat perplexed. How come this important detail never entered any official documentation?
Indeed.
Ok, so at least SLUB (which also uses Zeutschel for OCR) puts it in the physical structMap. But since both options are allowed, I guess we can as well keep it as it is. Now, coming back to my problem with generated images. This is how a page in MAX looks like: And this is what digital-derivans generates under They all look like this. ImageMagick complains about them like so:
There's nothing special on the logs. If you want to reproduce, here is the presentation without full text, and here is a version compatible with DFG Viewer which contains full text on selected pages. (You have to do the preprocessing as described above to get it working with digital-derivans.) So what could be causing these broken images? |
I'll have a look at this and report back. |
Okay, I could reproduce the effect with the OAI record data you provided. Have a look at the branch https://github.com/ulb-sachsen-anhalt/digital-derivans/tree/fix/jpg-render-baseline which contains my first guess to fix this behavior. You'll to have a local installed OpenJDK11 and Maven 3.6 to execute a |
Just found out that |
That worked! Now I can see correct JPEGs and I also get the text layer. One more problem: the global setting |
Great to hear! Can you please transfer the quality setting to a new issue? |
Sure, I'll spawn a new issue for each problem I found along the way. For some of them, I already have fixes. BTW, while compiling, I was surprised to see an exception with stacktrace – apparently, one of the METS files in the test set is not valid. Is that intentional? Instrumenting with a log message that shows the affected file name, and then validating against the METS schema externally, I found out this much: xmllint --noout --schema ../mets.xsd src/test/resources/mets/vls/vd18-9989442.ulb.xml
src/test/resources/mets/vls/vd18-9989442.ulb.xml:323: element file: Schemas validity error : Element '{http://www.loc.gov/METS/}file', attribute 'ID': 'IMG_MAX_10000000' is not a valid value of the atomic type 'xs:ID'.
src/test/resources/mets/vls/vd18-9989442.ulb.xml:510: element div: Schemas validity error : Element '{http://www.loc.gov/METS/}div', attribute 'ID': 'phys10000000' is not a valid value of the atomic type 'xs:ID'.
src/test/resources/mets/vls/vd18-9989442.ulb.xml fails to validate The reason seems to be that these identifiers appear multiple times. |
Further, regarding tests involved in IOW when I place my own config under |
Nay, you may create something like The configs from the build are just test sample configs and not meant to be used in productive scenarios. Don't worry for the error messages during build, this is intended behavior. It is better to know how an application deals with unknown or corrupt data, since they appear quite common in massive workflows mixed with legacy stuff from the past 20 years. Please note, one can completely turn off test execution when building with Maven like this |
As it turned out, enforcing progressive rendering, which is the actual workaround to avoid the pinked-up images, has a severe impact on performance. Test cases take 100% more time to finish, which is not acceptable if unnecessary. I'll try to get some more insights from the image data to trigger this only, if it's likely required. In the long term there can be a config flag which controls this. |
Couldn't we have some initial conversion step (only on the input side, before any image derivates are generated) to rid of these formats? |
Yes! That is exactly the way I've got in mind! |
I have a METS where all FLocats are LOCTYPE=URL (as required by DFGViewer), but local directories FULLTEXT and MAX do exist as well.
Unfortunately, digital-derivans does not seem to like this representation:
So do I have to convert the hrefs to local path?
The text was updated successfully, but these errors were encountered: