OCR0037: Extracting missing Tibetan glyphs & ligatures #1

10kalden · 2024-06-13T06:01:25Z

Description:

To create a new Tibetan font, around 1044 essential glyphs need to be added to the font, The glyphs and ligatures extraction from works has already been done and around 700 glyphs have been extracted to be used in the fonts.
Some glyphs are missing mainly Tibetan superscripts, subscripts and complex ligatures.
These missing were not obtained when the works were applied to Google OCR.
To obtain the missing glyphs another approach has to be taken.

Implementation plan:

Sub-task:

explore the OPF folders
Select all the ligatures with the required superscript and subscript to be cropped
Upload the subjoined images to s3 and create JSONL
Explore alternate procedures to extract ligature which were not caught by google OCR
Write a script to extract the ligatures from the transcribed text

Completion Criteria:

To obtain all the missing glyphs and ligatures

10kalden · 2024-06-14T06:26:55Z

To extract Tibetan subjoined letters, I am writing a script to parse all the Tibetan ligatures we have found to check for subjoined letters present in the ligatures. the ligature with the subjoined letter found will be uploaded to s3 and a JSONL file will be created with all the metadata to be loaded into Prodigy for annotation.

10kalden · 2024-06-18T05:54:45Z

To extract the complex ligatures that google OCR missed, I am using the transcribed text of work ID W2KG209989 (derge tengyur) in OPF. The script will parse the text to find the ligature's image number and span and use that to extract the glyphs.

10kalden self-assigned this Jun 13, 2024

10kalden changed the title ~~OCR0021: Extracting missing Tibetan glyphs & ligatures~~ OCR0030: Extracting missing Tibetan glyphs & ligatures Jun 13, 2024

10kalden changed the title ~~OCR0030: Extracting missing Tibetan glyphs & ligatures~~ OCR0037: Extracting missing Tibetan glyphs & ligatures Jul 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCR0037: Extracting missing Tibetan glyphs & ligatures #1

OCR0037: Extracting missing Tibetan glyphs & ligatures #1

10kalden commented Jun 13, 2024 •

edited by ta4tsering

Loading

10kalden commented Jun 14, 2024 •

edited

Loading

10kalden commented Jun 18, 2024

OCR0037: Extracting missing Tibetan glyphs & ligatures #1

OCR0037: Extracting missing Tibetan glyphs & ligatures #1

Comments

10kalden commented Jun 13, 2024 • edited by ta4tsering Loading

10kalden commented Jun 14, 2024 • edited Loading

10kalden commented Jun 18, 2024

10kalden commented Jun 13, 2024 •

edited by ta4tsering

Loading

10kalden commented Jun 14, 2024 •

edited

Loading