This project provides a script for performing OCR quality assessment using Bloom filters. It processes input text files, computes various OCR quality metrics, and outputs the results.
The build process has been tested on modern Linux and macOS systems. Before cloning the repository, ensure that the following dependencies are installed:
# Ubuntu or Debian
sudo apt update -y && sudo apt install -y git git-lfs make
# MacOS assuming a working Homebrew installation
brew upgrade && brew install git git-lfs make
-
Clone the repository:
git clone --recursive https://github.com/your-repo/impresso-ocr-qa-unigram.git cd impresso-ocr-qa-unigram
-
Configure the installation:
cp cookbook/dotenv.sample .env # edit .env with the s3 credentials
-
Install the required dependencies:
make setup
To run the OCR quality assessment script, use the following command:
python lib/ocrqa_bloom.py --input input.jsonl --bloomdicts bloom1.bloom bloom2.bloom --languages en fr --methods slc unk_ratio --output results.jsonl --lid langident.json
--log-file FILE
: Write log to FILE.-q, --quiet
: Do not print status messages to stderr (default: False).-v, --verbose-output
: Print verbose output information (default: False).-C, --single_letter_cost
: Cost for an infrequent single char (default: 0.7).-S, --single_symbol_cost
: Cost for an infrequent symbol char (default: 0.3).-l, --languages
: Language iso-2-letter codes (must match the sequence of provided bloom dictionaries).--input
: Input JSONL files (default: stdin).--bloomdicts
: Paths to JSON files containing bloom dictionaries keys or Hugging Face Hub references. Must match the sequence of provided languages.--unicode-normalization
: Unicode normalization form to apply to input text (default: NFKC).--log-level
: Logging level (default: INFO).--methods
: OCR QA methods to use (default: unk_type_ratio).--keep-best
: Keep only the highest OCR value for a given content item using the first method in --methods (default: False).--output
: Output file (default: stdout).--lid
: Path to language identification file.--s3-output-path
: S3 path to upload the output file after processing or check if it already exists.--quit-if-s3-output-exists
: Quit if the output file already exists in the specified S3 bucket.--keep-timestamp-only
: After uploading to S3, keep only the timestamp of the local output file for data efficiency.--s3-output-dry-run
: Dry run which suppresses all write operations to s3 and checks whether output files on s3 exist.
The default method used for OCR quality assessment is unk_type_ratio
. This method calculates the ratio of known unique subtoken types to all unique subtoken types. It provides a measure of how many unique words in the text are recognized by the Bloom filter, which can be an indicator of OCR quality.
python lib/ocrqa_bloom.py --input input.jsonl --bloomdicts hf://model_id/bloom1.bloom hf://model_id/bloom2.bloom --languages en fr --methods slc unk_ratio --output results.jsonl --lid langident.json
- Indicating long or stressed vowels
- gro’ss → modern grouss
- se’er → modern seier
- Marking elision or glottalization
- ge’nt, go’f, go’w (possible sound loss or separation)
- Clarifying pronunciation in loanwords
- Unio’n, situatio’n, millio’nen
- Separating prefixes or morphemes
- ne’deg → modern néideg
- we’neg → modern wéineg
- Pre-1946: Apostrophes were common after vowels, often inconsistently.
- 1946 Reform: Reduced apostrophe use, favoring phonetic spelling.
- 1975 Reform: Further simplification, removing unnecessary markers.
- 1999 Reform: Apostrophes after vowels were eliminated, except in contractions (e.g., d’Kanner remains, but se’er → seier).
The historical use of apostrophes after vowels served as a pronunciation guide for vowel length, stress, and borrowed words. Over time, Luxembourgish orthography standardized and simplified, leading to the apostrophe's removal in these contexts.
For any questions or issues, please contact [email protected].
Impresso - Media Monitoring of the Past is an interdisciplinary research project that aims to develop and consolidate tools for processing and exploring large collections of media archives across modalities, time, languages and national borders. The first project (2017-2021) was funded by the Swiss National Science Foundation under grant No. CRSII5_173719 and the second project (2023-2027) by the SNSF under grant No. CRSII5_213585) and the Luxembourg National Research Fund under grant No. 17498891.
Copyright (C) 2018-2024 The Impresso team.
Contributors to this program include: Maud Ehrmann
This program is provided as open source under the GNU Affero General Public License v3 or later.