Digital Derivans

Java command line tool that creates image derivates with different, configurable sizes and qualities, appends additional image footer and may assemble image files and OCR data to produce searchable PDF files with hidden text layer and an outline. If derivates can be created independently, like single scaled image variants, it uses parallel execution.

Uses mets-model for METS/MODS-handling, classical iText5 to create PDF, Apache log4j2 for logging and a workflow inspired by OCR-D/Core Workflows.

Features
Local Installation
Usage
Configuration
Limitations
License

Features

Create JPEG or PDF from TIFF or JPEG with optional Footer appended and custom constraints on compression rate and max sizes.
For details see configuration section.

If METS/MODS-information is available, the following will be taken into account:

Attribute mets:div[@ORDER] for file containers as defined in the METS physical structMap to create a PDF outline
Attribute mets:div[@CONTENTIDS] (granular URN) will be rendered for each page if footer shall be appended to each page image

Docker Image

There is an official Docker image there.

Pull the image:

docker pull ghcr.io/ulb-sachsen-anhalt/digital-derivans:latest

or build it your own locally:

./scripts/build_docker_image.sh

For using the docker image it is the same like described in Usage section, except you has to pass all required directories / files as mapped volumes.

For example:

docker run \
  --mount type=bind,source=<host-work-dir>,target=/data-print \
  --mount type=bind,source=<host-config-dir>,target=/data-config \
  --mount type=bind,source=<host-log-dir>,target=/data-log \
  ghcr.io/ulb-sachsen-anhalt/digital-derivans \ 
  <print-dir|mets-file> -c /data-config/derivans.ini

Please note:
Logging configuration must not be outside config dir.

Installation

Digital Derivans is a Java 11+ project build with Apache Maven.

Development Requirements

OpenJDK 11+
Maven 3.6+
git 2.12+

Pull and compile

Clone the repository and call Maven to trigger the build process, but be aware, that a recent OpenJDK is required.

git clone [email protected]:ulb-sachsen-anhalt/digital-derivans.git
cd digital-derivans
mvn clean package

This will first run the tests and afterwards create a shaded JAR ("FAT-JAR") inside the build directory (./target/digital-derivans-<version>.jar)

Usage

In local mode, a recent OpenJRE is required.

The tool expects a project folder containing an image directory (default: MAX) and optional OCR-data directory ( default: FULLTEXT').

The default name of the generated PDF is derived from the project folder name.

A sample folder structure:

my_print/
├── FULLTEXT
│   ├── 0002.xml
│   ├── 0021.xml
│   ├── 0332.xml
├── MAX
│   ├── 0002.tif
│   ├── 0021.tif
│   ├── 0332.tif

Running

java -jar <PATH>./target/digital-derivans-<version>.jar <path-to-my_print>`

will produce a file named my_print.pdf in the my_print directory from above with specified layout.
For more information concerning CLI-Usage, please consult CLI docs.

Configuration

Although Derivans can be run without configuration, it's strongly recommended. Many flags, especially if metadata must be taken into account, are using defaults tied to digitization workflows of ULB Sachsen-Anhalt that might not fit your custom requirements.

Configure Sections

Configuration options can be bundled into sections and customized with a INI-file.

Some params can be set on global level, like quality and poolsize.
Each section in a *.ini- file matching [derivate_<n>] represents a single derivate section for intermediate or final derivates.

Order of execution is determined by pairs of input-output paths, whereas numbering of derivate sections determines order at parse-time.

Default Values

On top of the INI-file are configuration values listed, which will be used as defaults for actual steps, if they can be applied.

default_quality : image data compression rate (can be specified with quality for image derivate sections)
default_poolsize : poolsize of worker threads for parallel processing (can be specified with poolsize for image derivate sections)

Section-specific Configuration

Some options values must be set individually for each step:

input_dir : path to directory with section input files
output_dir: path to directory for section output

Additional options can be set, according to of the actual type to derive:

Images:

quality : compression rate
poolsize : parallel workers
maximal : maximal dimension (affects both width and height)
footer_template : footer template Path
footer_label_copyright : additional (static) label for footer

PDF:

metadata_creator : enrich creator tag
metadata_keywords: enrich keywords
enrich_pdf_metadata : if PDF shall be enriched into METS/MODS (default: True)
mods_identifier_xpath : if not set, use mods:recordIdentifier from primary MODS
mets_filegroup_fulltext: METS-filegroup for OCR-Data (default: FULLTEXT)
mets_filegroup_images : METS-filegroup for image data (default: MAX)

Minimal working Example

The following example configuration contains global settings and subsequent generation steps.
(Example directory and file layout like from Usage section assumed.)

On global level, it sets the default JPEG-quality to 75, the number of parallel executors to 4 (recommended if at least 4 CPUs available) and determines the file for the logging-configuration.

Create JPEG images from images in sub directory MAX with compression rate 75, scale to maximal dimension 1000px and store in sub dir IMAGE_75.
Create PDF with images from IMAGE_75, add some PDF metadata and store file as my_print.pdf in current dir.

default_quality = 75
default_poolsize = 4
logger_configuration_file = derivans_logging.xml

[derivate_01]
input_dir = MAX
output_dir = IMAGE_75
maximal = 1000

[derivate_02]
type = pdf
input_dir = IMAGE_75
output_dir = .
output_type = pdf
metadata_creator = "<your organization label>"
metadata_license = "Public Domain Mark 1.0"

CLI Parameter

The main parameter for Derivans is the input path, which may be a local directory in local mode or the path to a local METS/MODS-file with sub directories for images and OCR-data, if using metadata.

Additionally, one can also provide via CLI

path to custom configuration INI-file
set labels for OCR and input-image (will overwrite configuration)
If metadata present, both will be used as filegroup names; For images they will also be used as input directory for initial image processing

Limitations

Derivans depends on standard JDK11-components and external components for image processing and PDF generation.

Step Configuration

Subsequent derivate steps must not have order gaps, since the parsing is done step by step. Otherwise, any derivate section after the first gap will be ignored, which may lead to unexpected results.

Image Processing

Please note:
To overcome javax.imageio errors, it's recommended to fix them using an external image processing application.

Images with more than 8bit channel depth can't be processed javax.imageio.IIOException: Illegal band size
Uncommon image metadata can't be processed
javax.imageio.IIOException: Unsupported marker
Integral dimension values required for proper scaling javax.imageio.metadata.IIOInvalidTreeException: Xdensity attribute out of range

PDF Generation

If Derivans is called from within the project folder, the resulting pdf will be called ..pdf.
iText PDF-Library limits the maximal page dimension to 14400 px ( weight/height, Configured max dimension fails for very large Images). This may cause trouble if one needs to generate PDF for very large prints like maps, deeds or scrolls.

Metadata

Derivans does not accept METS with current OCR-D-style nor any other METS which contains extended XML-features like inline namespace declarations.

License

This project's source code is licensed under terms of the MIT license.

NOTE: This project depends on components that may use different license terms.

Name		Name	Last commit message	Last commit date
Latest commit History 220 Commits
.github/workflows		.github/workflows
scripts		scripts
src		src
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Digital Derivans

Features

Docker Image

Installation

Development Requirements

Pull and compile

Usage

Configuration

Configure Sections

Default Values

Section-specific Configuration

Minimal working Example

CLI Parameter

Limitations

Step Configuration

Image Processing

PDF Generation

Metadata

License

About

Releases 8

Packages

Contributors 6

Languages

License

ulb-sachsen-anhalt/digital-derivans

Folders and files

Latest commit

History

Repository files navigation

Digital Derivans

Features

Docker Image

Installation

Development Requirements

Pull and compile

Usage

Configuration

Configure Sections

Default Values

Section-specific Configuration

Minimal working Example

CLI Parameter

Limitations

Step Configuration

Image Processing

PDF Generation

Metadata

License

About

Resources

License

Stars

Watchers

Forks

Releases 8

Packages 0

Contributors 6

Languages

Packages