1 Branch 1124 Tags

Name	Name	Last commit message	Last commit date
Latest commit kwcckw Merge pull request #35 from kwcckw/dev May 29, 2022 835f435 · May 29, 2022 History 86 Commits
.github/workflows	.github/workflows	fix(ci): correct sub directories	May 12, 2022
images	images	Added logo into readme.	Apr 27, 2022
.gitignore	.gitignore	add working code and rough draft without competition or clustering	Feb 17, 2022
LICENSE	LICENSE	Initial commit	Feb 5, 2022
README.md	README.md	Added example using shabby-pipeline and augraphy.	May 10, 2022
daily_build.py	daily_build.py	Merge pull request #31 from kwcckw/dev	May 18, 2022
daily_pipeline.py	daily_pipeline.py	move paper to a new repo, update build scripts, add action for daily …	Apr 15, 2022
example_shabby_pipeline_generation.ipynb	example_shabby_pipeline_generation.ipynb	Added example using shabby-pipeline and augraphy.	May 10, 2022
generate_kaggle_set.py	generate_kaggle_set.py	update daily build to use gdrive and azure blobs	Apr 23, 2022
letterfit.py	letterfit.py	move paper to a new repo, update build scripts, add action for daily …	Apr 15, 2022
make-submission.py	make-submission.py	update daily build to use gdrive and azure blobs	Apr 23, 2022
remove_blank_pages.py	remove_blank_pages.py	Updated daily build.	May 18, 2022
requirements.txt	requirements.txt	Added tweepy	May 13, 2022
shabbypipeline.py	shabbypipeline.py	Updates in probability.	May 6, 2022
tweet.py	tweet.py	Fixed bug.	May 29, 2022

Repository files navigation

ShabbyPages 2022

ShabbyPages is a corpus of born-digital document images with both ground truth and distorted versions appropriate for use in training models to reverse distortions and recover to original denoised documents. This state-of-the-art dataset with synthetically-generated real-world representations can be used to improve document layout detection, text extraction and OCR processes that depend on denoising and binarization preprocessing models.

Often, training data is not accompanied by clean ground truth sources, which leads to inaccurate training and severely-limited volumes of available training data. This dataset was created using Augraphy to produce a synthetic yet realistic dataset based on ground truth documents.

This repository contains the following scripts for producing the dataset:

letterfit.py, which defines a class that can fit images to a 8.5"x11" Letter page, similar to a document scanner.
shabbypipeline.py, which contains a parametrized default Augraphy pipeline.
daily_pipeline.py, similar to 2 but with modifications for the Shabby Pages set.
generate_kaggle_set.py, which produces the full dataset for the Kaggle competition.
remove_blank_pages.py, which removes images with >99% white pixels from the competition set.
make_submission.py, which produces the submission file for the Kaggle competition.
daily_build.py, which produces a small test set every day.
tweet.py, which tweets an example image from the daily build.
azure_file_service.py, which manages connections to Azure Files.
example_shabby_pipeline_generation.ipynb, which is an example to generate shabby images from aupgrahy and shabby pipeline using pdf input.

Distortion Pipeline

An Augraphy pipeline was applied to ground truth documents to generate printed, scanned, copied and faxed versions of documents encountered in the real world. In order to preserve a pixel-level mapping between ground truth and distorted versions of documents, geometric transformations that skew or warp document images were avoided.

   .
   .
   .
pipeline details or visual
   .
   .
   .

Credits / Prior Art

Below are related datasets that offer either real-world scanned documents or a combination of ground-truth and distorted versions.

Real-World Datasets

RVL-CDIP dataset consists of 400,000 B/W low-resolution (~100 DPI) images in 16 classes, with 25,000 images per class https://www.cs.cmu.edu/~aharley/rvl-cdip/
Tobacco3482 dataset from Kaggle offers 10 different classes of forms, letters, reports, etc. https://www.kaggle.com/patrickaudriaz/tobacco3482jpg
FUNSD (Form Understanding Noisy Scanned Documents) dataset on Kaggle comprises 199 real, fully annotated, scanned forms that are noisy and vary widely in appearance. https://www.kaggle.com/sharmaharsh/form-understanding-noisy-scanned-documentsfunsd

Synthentic Datasets

NoisyOffice dataset from University of California, Irvine contains noisy grayscale printed text images and their corresponding ground truth for both real and simulated documents with 4 types of noise: folded sheets, wrinkled sheets, coffee stains, and footprints. For each type of font, one type of Noise: 17 files * 4 types of noise = 72 images. https://archive.ics.uci.edu/ml/datasets/NoisyOffice
DDI-100 (Distorted Document Images) is a synthetic dataset by Ilia Zharikov et al based on 7000 real unique document pages and consists of more than 100000 augmented images. Ground truth comprises text and stamp masks, text and characters bounding boxes with relevant annotations. https://arxiv.org/abs/1912.11658
NIST-SFRS (Structured Forms Reference Set) consists of 5,590 pages of binary, black-and-white images of synthesized documents from 12 different tax forms from the IRS 1040 Package X for the year 1988. These include Forms 1040, 2106, 2441, 4562, and 6251 together with Schedules A, B, C, D, E, F, and SE. https://www.nist.gov/srd/nist-special-database-2

The Augraphy Project

The synthetic distortions in this dataset were generated by The Augraphy Project using a custom Augraphy pipeline to create realistic old and noisy documents from "born digital" sources. This simulation of realistic paper-oriented process distortions creates large amounts of training data for AI/ML processes to learn how to remove those distortions.

Augraphy is a Python library that creates multiple copies of original documents though an augmentation pipeline that randomly distorts each copy -- degrading the clean version into dirty and realistic copies rendered through synthetic paper printing, faxing, scanning and copy machine processes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ShabbyPages 2022

Distortion Pipeline

Credits / Prior Art

Real-World Datasets

Synthentic Datasets

The Augraphy Project

About

Releases 837

Packages

Contributors 6

Languages

License

sparkfish/shabby-pages

Folders and files

Latest commit

History

Repository files navigation

ShabbyPages 2022

Distortion Pipeline

Credits / Prior Art

Real-World Datasets

Synthentic Datasets

The Augraphy Project

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 837

Packages 0

Contributors 6

Languages

Packages