Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat/OCR: extract english texts from images #11

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

cindyli
Copy link
Contributor

@cindyli cindyli commented Jun 7, 2023

Description

Add an utility script that extracts English texts from images. This script will be part of the process to extract Bliss data from the archive website.

Steps to test

  1. Click a publication on the archive website. In the "Download options" section, click to download the format "SINGLE PAGE PROCESSED JP2 ZIP". Unzip the file on a local computer. Place its content, a bunch of JP2 images, in a directory.
  2. Follow the instruction in README to run the utility script.

Expected behavior:

English texts in each image should be extracted and saved as a txt file in the same directory.

@cindyli cindyli requested review from agamba and klown June 7, 2023 18:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant