Java command line tool that creates image derivates with different, configurable sizes and qualities, appends additional image footer and may assemble image files and OCR data to produce searchable PDF files with hidden text layer and an outline. If derivates can be created independently, like single scaled image variants, it uses parallel execution.
Uses mets-model for METS/MODS-handling, classical iText5 to create PDF, Apache log4j2 for logging and a workflow inspired by OCR-D/Core Workflows.
Create JPEG or PDF from TIFF or JPEG with optional Footer appended and custom constraints on compression rate and max
sizes.
For details see configuration section.
If METS/MODS-information is available, the following will be taken into account:
- Attribute
mets:div[@ORDER]
for file containers as defined in the METS physical structMap to create a PDF outline - Attribute
mets:div[@CONTENTIDS]
(granular URN) will be rendered for each page if footer shall be appended to each page image
There is an official Docker image there.
Pull the image:
docker pull ghcr.io/ulb-sachsen-anhalt/digital-derivans:latest
or build it your own locally:
./scripts/build_docker_image.sh
For using the docker image it is the same like described in Usage section, except you has to pass all required directories / files as mapped volumes.
For example:
docker run \
--mount type=bind,source=<host-work-dir>,target=/data-print \
--mount type=bind,source=<host-config-dir>,target=/data-config \
--mount type=bind,source=<host-log-dir>,target=/data-log \
ghcr.io/ulb-sachsen-anhalt/digital-derivans \
<print-dir|mets-file> -c /data-config/derivans.ini
Please note:
Logging configuration must not be outside config dir.
Digital Derivans is a Java 11+ project build with Apache Maven.
- OpenJDK 11+
- Maven 3.6+
- git 2.12+
Clone the repository and call Maven to trigger the build process, but be aware, that a recent OpenJDK is required.
git clone [email protected]:ulb-sachsen-anhalt/digital-derivans.git
cd digital-derivans
mvn clean package
This will first run the tests and afterwards create a shaded JAR ("FAT-JAR") inside the build
directory (./target/digital-derivans-<version>.jar
)
In local mode, a recent OpenJRE is required.
The tool expects a project folder containing an image directory (default: MAX
) and optional OCR-data directory (
default: FULLTEXT
').
The default name of the generated PDF is derived from the project folder name.
A sample folder structure:
my_print/
├── FULLTEXT
│ ├── 0002.xml
│ ├── 0021.xml
│ ├── 0332.xml
├── MAX
│ ├── 0002.tif
│ ├── 0021.tif
│ ├── 0332.tif
Running
java -jar <PATH>./target/digital-derivans-<version>.jar <path-to-my_print>`
will produce a file named my_print.pdf
in the my_print
directory from above with specified layout.
For more information concerning CLI-Usage, please consult CLI docs.
Although Derivans can be run without configuration, it's strongly recommended. Many flags, especially if metadata must be taken into account, are using defaults tied to digitization workflows of ULB Sachsen-Anhalt that might not fit your custom requirements.
Configuration options can be bundled into sections and customized with a INI-file.
Some params can be set on global level, like quality and poolsize.
Each section in a *.ini
- file matching [derivate_<n>]
represents a single derivate section for intermediate or final
derivates.
Order of execution is determined by pairs of input-output paths, whereas numbering of derivate sections determines order at parse-time.
On top of the INI-file are configuration values listed, which will be used as defaults for actual steps, if they can be applied.
default_quality
: image data compression rate (can be specified withquality
for image derivate sections)default_poolsize
: poolsize of worker threads for parallel processing (can be specified withpoolsize
for image derivate sections)
Some options values must be set individually for each step:
input_dir
: path to directory with section input filesoutput_dir
: path to directory for section output
Additional options can be set, according to of the actual type to derive:
Images:
quality
: compression ratepoolsize
: parallel workersmaximal
: maximal dimension (affects both width and height)footer_template
: footer template Pathfooter_label_copyright
: additional (static) label for footer
PDF:
metadata_creator
: enrich creator tagmetadata_keywords
: enrich keywordsenrich_pdf_metadata
: if PDF shall be enriched into METS/MODS (default:True
)mods_identifier_xpath
: if not set, usemods:recordIdentifier
from primary MODSmets_filegroup_fulltext
: METS-filegroup for OCR-Data (default:FULLTEXT
)mets_filegroup_images
: METS-filegroup for image data (default:MAX
)
The following example configuration contains global settings and subsequent generation steps.
(Example directory and file layout like from Usage section assumed.)
On global level, it sets the default JPEG-quality to 75
, the number of parallel executors to 4
(recommended if at
least 4 CPUs available) and determines the file for the logging-configuration.
- Create JPEG images from images in sub directory
MAX
with compression rate 75, scale to maximal dimension 1000px and store in sub dirIMAGE_75
. - Create PDF with images from
IMAGE_75
, add some PDF metadata and store file asmy_print.pdf
in current dir.
default_quality = 75
default_poolsize = 4
logger_configuration_file = derivans_logging.xml
[derivate_01]
input_dir = MAX
output_dir = IMAGE_75
maximal = 1000
[derivate_02]
type = pdf
input_dir = IMAGE_75
output_dir = .
output_type = pdf
metadata_creator = "<your organization label>"
metadata_license = "Public Domain Mark 1.0"
The main parameter for Derivans is the input path, which may be a local directory in local mode or the path to a local METS/MODS-file with sub directories for images and OCR-data, if using metadata.
Additionally, one can also provide via CLI
- path to custom configuration INI-file
- set labels for OCR and input-image (will overwrite configuration)
If metadata present, both will be used as filegroup names; For images they will also be used as input directory for initial image processing
Derivans depends on standard JDK11-components and external components for image processing and PDF generation.
- Subsequent derivate steps must not have order gaps, since the parsing is done step by step. Otherwise, any derivate section after the first gap will be ignored, which may lead to unexpected results.
Please note:
To overcome javax.imageio
errors, it's recommended to fix them using an external image processing application.
- Images with more than 8bit channel depth can't be processed javax.imageio.IIOException: Illegal band size
- Uncommon image metadata can't be processed
javax.imageio.IIOException: Unsupported marker - Integral dimension values required for proper scaling javax.imageio.metadata.IIOInvalidTreeException: Xdensity attribute out of range
- If Derivans is called from within the project folder, the resulting pdf will be called
..pdf
. - iText PDF-Library limits the maximal page dimension to 14400 px ( weight/height, Configured max dimension fails for very large Images). This may cause trouble if one needs to generate PDF for very large prints like maps, deeds or scrolls.
- Derivans does not accept METS with current OCR-D-style nor any other METS which contains extended XML-features like inline namespace declarations.
This project's source code is licensed under terms of the MIT license.
NOTE: This project depends on components that may use different license terms.