Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

configured input_dir is ignored #41

Open
bertsky opened this issue Apr 20, 2023 · 1 comment
Open

configured input_dir is ignored #41

bertsky opened this issue Apr 20, 2023 · 1 comment

Comments

@bertsky
Copy link

bertsky commented Apr 20, 2023

If you dare to configure anything other than MAX / FULLTEXT in a config file, this will create inconsistent (dysfunctional) file paths, e.g.

stacktrace

2023-04-20 12:23:13 [ERROR] (App:45) java.io.FileNotFoundException: /home/kmw/nfs/schütz-test/DEFAULT/00000001.tif.large.jpg (Datei oder Verzeichnis nicht gefunden)
de.ulb.digital.derivans.DigitalDerivansException: java.io.FileNotFoundException: /home/kmw/nfs/schütz-test/DEFAULT/00000001.tif.large.jpg (Datei oder Verzeichnis nicht gefunden)
	at de.ulb.digital.derivans.derivate.PDFDerivateer.create(PDFDerivateer.java:401)
	at de.ulb.digital.derivans.Derivans.create(Derivans.java:181)
	at de.ulb.digital.derivans.App.main(App.java:43)
Caused by: java.io.FileNotFoundException: /home/kmw/nfs/schütz-test/DEFAULT/00000001.tif.large.jpg (Datei oder Verzeichnis nicht gefunden)
	at java.base/java.io.FileInputStream.open0(Native Method)
	at java.base/java.io.FileInputStream.open(FileInputStream.java:219)
	at java.base/java.io.FileInputStream.<init>(FileInputStream.java:157)
	at java.base/java.io.FileInputStream.<init>(FileInputStream.java:112)
	at java.base/sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:86)
	at java.base/sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:184)
	at java.base/java.net.URL.openStream(URL.java:1140)
	at com.itextpdf.text.Image.getInstance(Image.java:260)
	at com.itextpdf.text.Image.getInstance(Image.java:241)
	at com.itextpdf.text.Image.getInstance(Image.java:364)
	at de.ulb.digital.derivans.derivate.PDFDerivateer.create(PDFDerivateer.java:390)

(Notice here how the path is combined from a non-default configured input_dir DEFAULT as directory name with the file name from the default MAX.)

The problem is here:

DigitalPage page = new DigitalPage(n);
LOGGER.debug("create digital page from {}", fptrs);
// handle image file
Optional<FilePointerMatch> optMaxImage = fptrs.stream()
.filter(fptr -> FILEGROUP_MAX.equals(fptr.getFileGroup())).findFirst();
if (optMaxImage.isPresent()) {
FilePointerMatch match = optMaxImage.get();
enrichImageData(physSubDiv, page, match);
}
// handle optional attached ocr file
LOGGER.trace("search for {} within {}", FILEGROUP_FULLTEXT, fptrs);
Optional<FilePointerMatch> optFulltext = fptrs.stream()
.filter(fptr -> FILEGROUP_FULLTEXT.equals(fptr.getFileGroup())).findFirst();
if (optFulltext.isPresent()) {
FilePointerMatch match = optFulltext.get();
enrichFulltextData(physSubDiv, page, match);
}

This sets up all DigitalPage instances with paths from MAX and FULLTEXT, irrespective of the configuration.

This is later combined with the first input step's input_dir:

pages = store.getDigitalPagesInOrder();
resolver.enrichAbsoluteStartPath(pages, step0.getInputPath());

@M3ssman
Copy link
Member

M3ssman commented Apr 27, 2023

We should do list all components and parameters, which require/benefit from configuration and remove as many hard-coded assumptions as possible:

  • Metadatastore - calculation of PDF label/identifier, directories for fulltext and input start, filename extensions
  • Pathresolver - filename extensions, directories for fulltext and input start
  • ImageProcessor - quality, maximum size, input/output directories, enforce progressive renderging (c.f. NPE on getImagePath if LOCTYPE=URL #39)
  • PDFDerivateer - render level, font family(?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants