Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCR0037: Extracting missing Tibetan glyphs & ligatures #1

Open
5 tasks done
10kalden opened this issue Jun 13, 2024 · 2 comments
Open
5 tasks done

OCR0037: Extracting missing Tibetan glyphs & ligatures #1

10kalden opened this issue Jun 13, 2024 · 2 comments
Assignees

Comments

@10kalden
Copy link
Contributor

10kalden commented Jun 13, 2024

Description:

  • To create a new Tibetan font, around 1044 essential glyphs need to be added to the font, The glyphs and ligatures extraction from works has already been done and around 700 glyphs have been extracted to be used in the fonts.
  • Some glyphs are missing mainly Tibetan superscripts, subscripts and complex ligatures.
  • These missing were not obtained when the works were applied to Google OCR.
  • To obtain the missing glyphs another approach has to be taken.

Implementation plan:

Image

Sub-task:

  • explore the OPF folders
  • Select all the ligatures with the required superscript and subscript to be cropped
  • Upload the subjoined images to s3 and create JSONL
  • Explore alternate procedures to extract ligature which were not caught by google OCR
  • Write a script to extract the ligatures from the transcribed text

Completion Criteria:

  • To obtain all the missing glyphs and ligatures
@10kalden 10kalden self-assigned this Jun 13, 2024
@10kalden 10kalden changed the title OCR0021: Extracting missing Tibetan glyphs & ligatures OCR0030: Extracting missing Tibetan glyphs & ligatures Jun 13, 2024
@10kalden
Copy link
Contributor Author

10kalden commented Jun 14, 2024

To extract Tibetan subjoined letters, I am writing a script to parse all the Tibetan ligatures we have found to check for subjoined letters present in the ligatures. the ligature with the subjoined letter found will be uploaded to s3 and a JSONL file will be created with all the metadata to be loaded into Prodigy for annotation.

@10kalden
Copy link
Contributor Author

To extract the complex ligatures that google OCR missed, I am using the transcribed text of work ID W2KG209989 (derge tengyur) in OPF. The script will parse the text to find the ligature's image number and span and use that to extract the glyphs.

@10kalden 10kalden changed the title OCR0030: Extracting missing Tibetan glyphs & ligatures OCR0037: Extracting missing Tibetan glyphs & ligatures Jul 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: DONE
Development

No branches or pull requests

1 participant