[RFC0020]: Layout annotating tool #12

ta4tsering · 2023-01-23T09:09:09Z

Housekeeping

[RFC0020]: Layout annotating tool

ALL BELOW FIELDS ARE REQUIRED

Named Concepts

prodigy: annotating tool
layout analysis model: model which detects different layout or components in a image
OCR: Optical character recognition

Summary

Making an Instance for the annotating the layout of bdrc images using prodi.gy

Reference-Level Explanation

In order to get diverse images for annotating:
- Pick 10 collections (coherent set of images of a similar style)
- Download images and prepare thumbnails
- Annotator will select ~100 interesting samples
- Selected images will be prepare to load in prodigy
We should launch a new instance for layout analysis, which means:
- a new systemd service
- a new configuration file
- a new recipe
- a new nginx configuration on the server + a new ssl certificate
Annotator will annotate the images
ML Engineer will train one model per collection and one model with all data combined
ML Engineer sends images for reviewing --> streamed to Prodigy
Re-train, test etc until model performs very well
Move on to next collections

Our annotating UI will be like this:

Alternatives

Confirm that alternative approaches have been evaluated and explain those alternatives briefly.

Rationale

Why the currently proposed design was selected over alternatives?

What would be the impact of going with one of the alternative approaches?

Is the evaluation tentative, or is it recommended to use more time to evaluate different approaches?

Drawbacks

Describe any particular caveats and drawbacks that may arise from fulfilling this particular request?

Useful References

Describe useful parallels and learnings from other requests, or work in previous projects.

What similar work have we already successfully completed?: we have made an instance to crop bdrc images.

Is this something that have already been built by others?

What other related learnings we have?

Are there useful academic literature or other articles related with this topic? (provide links)

Have we built a relevant prototype previously?

Do we have a rough mock for the UI/UX?

Do we have a schematic for the system?

Unresolved Questions

What is there that is unresolved (and will be resolved as part of fulfilling this request)?

Are there other requests with same or similar problems to solve?

Parts of the System Affected

Which parts of the current system are affected by this request?

What other open requests are closely related with this request?

Does this request depend on fulfillment of any other request?

Does any other request depend on the fulfillment of this request?*

Future possibilities

How do you see the particular system or part of the system affected by this request be altered or extended in the future.

Infrastructure

Describe the new infrastructure or changes in current infrastructure required to fulfill this request.

Testing

Describe the kind of testing procedures that are needed as part of fulfilling this request.

Documentation

Describe the level of documentation fulfilling this request involves. Consider both end-user documentation and developer documentation.

Version History

v.0.2

Recordings

Meeting minutes (@eric86y, @eroux , @ngawangtrinley , @ta4tsering , @kaldan007 )

Experiment with Google Colab for batches of around 1k
1 GPU for 100k images takes 24h
Google Colab Pro with 500 credits is good if train twice a month
4 GPUs for OCR would be comfortable
Transfer in AWS is free, need to investigate if running it somewhere else is cheaper

Work Phases

We should launch a new instance for layout analysis, which means:

a new systemd service
a new configuration file
a new recipe
a new nginx configuration on the server + a new ssl certificate
- estimated time: 2 hours
- time taken:

Non-Coding

Keep original naming and structure, and keep as first section in Work phases section

Planning
Documentation
Testing

Implementation

A list of checkboxes, one per PR. Each PR should have a descriptive name that clearly illustrates what the work phase is about.

kaldan007 · 2023-01-23T10:01:37Z

@eroux regarding the order @ngawangtrinley is suggesting to collect from all the work in bdrc

kaldan007 · 2023-01-23T10:02:53Z

@ta4tsering we need to filter only tibetan work. @eroux is it possible to get from the ttl

eroux · 2023-01-23T10:07:23Z

I think a better system would be to have a good balance between:

manuscripts in Uchen
manuscripts in Ume
manuscripts from Dunhuang
Tibetan prints
Chinese prints
Mongolian prints
Buryat prints
Khmer manuscripts (both short and long palm leaves)
Burmese manuscripts
modern prints in book format
modern prints in pecha format
manuscripts from Nepal

perhaps starting with 500 of each. If / when it finishes, then we can start thinking of the next steps. That's the kind of proposal I was expecting...

kaldan007 · 2023-01-23T10:10:39Z

@eroux Can i know why do we need burmese?

eroux · 2023-01-23T10:11:34Z

because we want to do the exact same thing (layout detection, OCR, OCR cleanup, etc.) for all languages. It's a comparatively minor cost with a lot of potential benefits

ta4tsering · 2023-01-23T10:22:55Z

Can you suggest a way to classify the works into those above mentioned types of prints or manuscript ? can we get that from ttl file of the work ? for example, for modern print I can use this bdo:printMethod bdr:PrintMethod_Modern from the ttl.

eroux · 2023-01-23T10:25:36Z

basically you have to understand that the end goal is not annotations for annotations' sake. The goal is to have a dataset that we can use to train a model that will do some layout detection. Producing the best dataset for such a model should be the number 1 requirement. Now, there are several ways of creating such a dataset, and I feel that right now we haven't even touched the question of what we wanted. I think it would be reasonable to have in fact 3 datasets:

himalayan pecha that could contain:
- block prints (Tibetan, Chinese, Mongolian, Buryat)
- manuscripts (Tibetan, Mongolian, Nepalese)
- modern editions in pecha format, ideally from different regions
modern books
palm leaves manuscripts (that could be trained on Nepalese, Khmer and Burmese palm leaves manuscripts)

I can work on something like that, it's not particularly straightforward but it can be ready in a few days

eroux · 2023-01-23T10:32:36Z

constituting this dataset is part of the work, and people in different domains should be consulted:

AI experts
persons who know the BDRC database
persons who will take general policy decisions for the project

the work consists in:

organizing communication between these stakeholders
producing a document recording the policy decisions
producing scripts to get the right data
using the right data in the recipe

ngawangtrinley · 2023-01-23T10:33:24Z

For layout detection I guess we can do everything regardless of the language. For OCR we will have to limit ourselves to Tibetan since that's what the grant is for.

eroux · 2023-01-23T10:34:03Z

sure, OCR for non-Tibetan script is out of the scope

eroux · 2023-01-23T10:37:23Z

(BTW, the estimation of the rest of the work are totally off I think, it can take just 2h total, the juicy part is the dataset)

ngawangtrinley · 2023-01-23T10:45:42Z

@eroux this RFC is not finalized! It is in the process of "request for comments" and we are consulting you and other specialists before finalizing the work plan and starting to code. Please don't hesitate to give feedback and opinion and we'll integrate it in our plan.

eroux · 2023-01-24T07:28:26Z

as requested, here are a few collections that I think could be good:

W26071 (Zhol)
W3CN20612 (Dege)
W1PD96685 (Cone Kangyur)
W1KG26108 (Chinese print)
W14322 (Chinese print, different layout)
W29468 (Mongolian print)
W1KG89102 (Buryat print)
W1PD100944 (modern pecha print)
W1KG14700 (clear manuscript)
https://library.bdrc.io/search?r=bdr:PR1TIBET00&t=Scan&s=title%20forced (old Bodong prints, the NGMPP title cards should be removed)

that's all I can think of right now, but don't hesitate to add to it!

kaldan007 · 2023-01-24T08:32:18Z

@eroux thank you for the list

kaldan007 · 2023-01-24T09:27:47Z

@eroux is it a good idea to zip the images and save in a github directory and share the zip file link to annotator?

eroux · 2023-01-24T09:30:55Z

this will be too big for github, I think OpenPecha needs a way to provide files online. S3 is a good solution for that, you can upload a zip on s3 an send a URL with a token that's valid 1 week or so. This works well at scale

kaldan007 · 2023-01-24T09:32:15Z

ok

eroux · 2023-01-24T09:36:27Z

Also, unrelated to the thumbnails issue, we need to determine the URL scheme of the various instances of Prodigy. Currently the instance for cropping is on https://prodigy.bdrc.io, but where would be put the instance for layout? Perhaps having a schema like prodigy.bdrc.io/{recipe_name}/ (so https://prodigy.bdrc.io/bdrc_crop/ and https://prodigy.bdrc.io/layout_analysis/) would be good?

kaldan007 · 2023-01-24T12:45:35Z

@eroux I agree with naming. @ta4tsering what do you think

ta4tsering · 2023-01-24T13:47:11Z

yeah, the naming sounds good. I have just changed the port in the nginx configuration file layout_analysis.conf and prodigy configuration file layout_analysis.json to port 8090 and restarted the server to test and it works fine, now I will look into URL schema.

eric86y · 2023-01-25T08:11:03Z

Have you decided upon a layout scheme for annotating images, i.e. which elements you will annotate, etc.? Moreover, can Prodigy handle multi-class annotation/tagging, so that you can also add a script tag to an annotated line?

eroux · 2023-01-25T08:17:58Z

@eric86y see #10 for the current schema. @kaldan007 perhaps this could be integrated in the RFC?

eric86y · 2023-01-25T08:22:21Z

No lines?

kaldan007 · 2023-01-25T08:27:34Z

@eric86y are we suppose to include line? we were think it as separate. Can it be together? If yes we will update the UI accordingly

eric86y · 2023-01-25T08:29:51Z

I think it depends on how the annotation pipeline is organized, if you split this then it is ok. But for doing OCR that is based on lines, you'll need robust line detection as well. This can theoretically be done by another team. I was just considering tagging the script type in the process to train a script classification down the road.

eroux · 2023-01-25T08:30:03Z

I think the plan is to have different steps:

one pass to detect pages only
one pass to detect layout features only
one pass to detect lines of the main text area

this would be done in 3 different prodigy instances

kaldan007 · 2023-01-25T08:49:47Z

I think the plan is to have different steps:

one pass to detect pages only

one pass to detect layout features only

one pass to detect lines of the main text area

this would be done in 3 different prodigy instances

yes we are planning to go this way

eroux · 2023-01-25T09:25:08Z

@kaldan007 just as a general remark: the RFC is 40% filler (everything in italic), 60% information, perhaps we could either transform the filler into information or remove it? it will make this RFC thing much more appealing (it currently feels like a normal github issue + some unnecessary copy/paste of a template to make it look professional)

My main point though is the following: please create the zip files for the collection on an AWS EC2 instance in the us-east-1 zone (like the prodigy server) so that we can minimize the transfer cost. Thanks!

kaldan007 · 2023-01-25T09:41:43Z

@eroux i hundred percent agree with you. @ngawangtrinley we definitely need to simplify a bit. I am getting complains from our team also.

Regarding zip image collection, @ngawangtrinley says he has images of those collection in hard drvie. So rather than downloading we thought to make the thumbnails from his system. each collections will be having one repo with an empty folder named unique_images where annotator are suppose to copy paste the unique images. the thumnail zip file will added to the release of that repo. the downloadable link will be added to the readme of that repo with instructions to select images.

eroux · 2023-01-25T09:44:58Z

well, what we need is an automated pipeline that we can trigger, relying on NT maybe having some images on his hard drive may work for the first 10 collections but it doesn't look like this can be used in an automatic workflow... that's just my 2c

kaldan007 · 2023-01-25T09:48:51Z

Shall we do the mentioned procedure for first 10 collections. by the time our annotators are occupied we can assign the fully automated workflow to a developer. What do you think? @eroux

eroux · 2023-01-25T09:52:57Z

sure!

ta4tsering transferred this issue from OpenPecha/Requests Jan 23, 2023

ta4tsering self-assigned this Jan 23, 2023

kaldan007 self-assigned this Jan 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC0020]: Layout annotating tool #12

[RFC0020]: Layout annotating tool #12

ta4tsering commented Jan 23, 2023 •

edited by kaldan007

Loading

kaldan007 commented Jan 23, 2023

kaldan007 commented Jan 23, 2023 •

edited

Loading

eroux commented Jan 23, 2023 •

edited

Loading

kaldan007 commented Jan 23, 2023

eroux commented Jan 23, 2023 •

edited

Loading

ta4tsering commented Jan 23, 2023 •

edited

Loading

eroux commented Jan 23, 2023

eroux commented Jan 23, 2023 •

edited

Loading

ngawangtrinley commented Jan 23, 2023

eroux commented Jan 23, 2023

eroux commented Jan 23, 2023

ngawangtrinley commented Jan 23, 2023

eroux commented Jan 24, 2023

kaldan007 commented Jan 24, 2023

kaldan007 commented Jan 24, 2023

eroux commented Jan 24, 2023

kaldan007 commented Jan 24, 2023

eroux commented Jan 24, 2023

kaldan007 commented Jan 24, 2023

ta4tsering commented Jan 24, 2023

eric86y commented Jan 25, 2023

eroux commented Jan 25, 2023

eric86y commented Jan 25, 2023

kaldan007 commented Jan 25, 2023

eric86y commented Jan 25, 2023

eroux commented Jan 25, 2023

kaldan007 commented Jan 25, 2023

eroux commented Jan 25, 2023

kaldan007 commented Jan 25, 2023

eroux commented Jan 25, 2023

kaldan007 commented Jan 25, 2023

eroux commented Jan 25, 2023

[RFC0020]: Layout annotating tool #12

[RFC0020]: Layout annotating tool #12

Comments

ta4tsering commented Jan 23, 2023 • edited by kaldan007 Loading

Table of Contents

Housekeeping

Named Concepts

Summary

Reference-Level Explanation

Alternatives

Rationale

Drawbacks

Useful References

Unresolved Questions

Parts of the System Affected

Future possibilities

Infrastructure

Testing

Documentation

Version History

Recordings

Work Phases

Non-Coding

Implementation

kaldan007 commented Jan 23, 2023

kaldan007 commented Jan 23, 2023 • edited Loading

eroux commented Jan 23, 2023 • edited Loading

kaldan007 commented Jan 23, 2023

eroux commented Jan 23, 2023 • edited Loading

ta4tsering commented Jan 23, 2023 • edited Loading

eroux commented Jan 23, 2023

eroux commented Jan 23, 2023 • edited Loading

ngawangtrinley commented Jan 23, 2023

eroux commented Jan 23, 2023

eroux commented Jan 23, 2023

ngawangtrinley commented Jan 23, 2023

eroux commented Jan 24, 2023

kaldan007 commented Jan 24, 2023

kaldan007 commented Jan 24, 2023

eroux commented Jan 24, 2023

kaldan007 commented Jan 24, 2023

eroux commented Jan 24, 2023

kaldan007 commented Jan 24, 2023

ta4tsering commented Jan 24, 2023

eric86y commented Jan 25, 2023

eroux commented Jan 25, 2023

eric86y commented Jan 25, 2023

kaldan007 commented Jan 25, 2023

eric86y commented Jan 25, 2023

eroux commented Jan 25, 2023

kaldan007 commented Jan 25, 2023

eroux commented Jan 25, 2023

kaldan007 commented Jan 25, 2023

eroux commented Jan 25, 2023

kaldan007 commented Jan 25, 2023

eroux commented Jan 25, 2023

ta4tsering commented Jan 23, 2023 •

edited by kaldan007

Loading

kaldan007 commented Jan 23, 2023 •

edited

Loading

eroux commented Jan 23, 2023 •

edited

Loading

eroux commented Jan 23, 2023 •

edited

Loading

ta4tsering commented Jan 23, 2023 •

edited

Loading

eroux commented Jan 23, 2023 •

edited

Loading