Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC0020]: Layout annotating tool #12

Open
4 of 7 tasks
ta4tsering opened this issue Jan 23, 2023 · 32 comments
Open
4 of 7 tasks

[RFC0020]: Layout annotating tool #12

ta4tsering opened this issue Jan 23, 2023 · 32 comments
Assignees

Comments

@ta4tsering
Copy link
Contributor

ta4tsering commented Jan 23, 2023

Table of Contents

Housekeeping

[RFC0020]: Layout annotating tool

ALL BELOW FIELDS ARE REQUIRED

Named Concepts

prodigy: annotating tool
layout analysis model: model which detects different layout or components in a image
OCR: Optical character recognition

Summary

Making an Instance for the annotating the layout of bdrc images using prodi.gy

Reference-Level Explanation

  • In order to get diverse images for annotating:

    • Pick 10 collections (coherent set of images of a similar style)
    • Download images and prepare thumbnails
    • Annotator will select ~100 interesting samples
    • Selected images will be prepare to load in prodigy
  • We should launch a new instance for layout analysis, which means:

    • a new systemd service
    • a new configuration file
    • a new recipe
    • a new nginx configuration on the server + a new ssl certificate
  • Annotator will annotate the images

  • ML Engineer will train one model per collection and one model with all data combined

  • ML Engineer sends images for reviewing --> streamed to Prodigy

  • Re-train, test etc until model performs very well

  • Move on to next collections

Our annotating UI will be like this:

recidpe

Alternatives

Confirm that alternative approaches have been evaluated and explain those alternatives briefly.

Rationale

  • Why the currently proposed design was selected over alternatives?
  • What would be the impact of going with one of the alternative approaches?
  • Is the evaluation tentative, or is it recommended to use more time to evaluate different approaches?

Drawbacks

Describe any particular caveats and drawbacks that may arise from fulfilling this particular request?

Useful References

Describe useful parallels and learnings from other requests, or work in previous projects.

  • What similar work have we already successfully completed?: we have made an instance to crop bdrc images.
  • Is this something that have already been built by others?
  • What other related learnings we have?
  • Are there useful academic literature or other articles related with this topic? (provide links)
  • Have we built a relevant prototype previously?
  • Do we have a rough mock for the UI/UX?
  • Do we have a schematic for the system?

Unresolved Questions

  • What is there that is unresolved (and will be resolved as part of fulfilling this request)?
  • Are there other requests with same or similar problems to solve?

Parts of the System Affected

  • Which parts of the current system are affected by this request?
  • What other open requests are closely related with this request?
  • Does this request depend on fulfillment of any other request?
  • Does any other request depend on the fulfillment of this request?*

Future possibilities

How do you see the particular system or part of the system affected by this request be altered or extended in the future.

Infrastructure

Describe the new infrastructure or changes in current infrastructure required to fulfill this request.

Testing

Describe the kind of testing procedures that are needed as part of fulfilling this request.

Documentation

Describe the level of documentation fulfilling this request involves. Consider both end-user documentation and developer documentation.

Version History

v.0.2

Recordings

Meeting minutes (@eric86y, @eroux , @ngawangtrinley , @ta4tsering , @kaldan007 )

  • Experiment with Google Colab for batches of around 1k
  • 1 GPU for 100k images takes 24h
  • Google Colab Pro with 500 credits is good if train twice a month
  • 4 GPUs for OCR would be comfortable
  • Transfer in AWS is free, need to investigate if running it somewhere else is cheaper

Work Phases

We should launch a new instance for layout analysis, which means:

  • a new systemd service
  • a new configuration file
  • a new recipe
  • a new nginx configuration on the server + a new ssl certificate
    • estimated time: 2 hours
    • time taken:

Non-Coding

Keep original naming and structure, and keep as first section in Work phases section

  • Planning
  • Documentation
  • Testing

Implementation

A list of checkboxes, one per PR. Each PR should have a descriptive name that clearly illustrates what the work phase is about.

@ta4tsering ta4tsering transferred this issue from OpenPecha/Requests Jan 23, 2023
@ta4tsering ta4tsering self-assigned this Jan 23, 2023
@kaldan007
Copy link
Contributor

@eroux regarding the order @ngawangtrinley is suggesting to collect from all the work in bdrc

@kaldan007
Copy link
Contributor

kaldan007 commented Jan 23, 2023

@ta4tsering we need to filter only tibetan work. @eroux is it possible to get from the ttl

@eroux
Copy link
Contributor

eroux commented Jan 23, 2023

I think a better system would be to have a good balance between:

  • manuscripts in Uchen
  • manuscripts in Ume
  • manuscripts from Dunhuang
  • Tibetan prints
  • Chinese prints
  • Mongolian prints
  • Buryat prints
  • Khmer manuscripts (both short and long palm leaves)
  • Burmese manuscripts
  • modern prints in book format
  • modern prints in pecha format
  • manuscripts from Nepal

perhaps starting with 500 of each. If / when it finishes, then we can start thinking of the next steps. That's the kind of proposal I was expecting...

@kaldan007
Copy link
Contributor

@eroux Can i know why do we need burmese?

@eroux
Copy link
Contributor

eroux commented Jan 23, 2023

because we want to do the exact same thing (layout detection, OCR, OCR cleanup, etc.) for all languages. It's a comparatively minor cost with a lot of potential benefits

@ta4tsering
Copy link
Contributor Author

ta4tsering commented Jan 23, 2023

Can you suggest a way to classify the works into those above mentioned types of prints or manuscript ? can we get that from ttl file of the work ? for example, for modern print I can use this bdo:printMethod bdr:PrintMethod_Modern from the ttl.

@eroux
Copy link
Contributor

eroux commented Jan 23, 2023

basically you have to understand that the end goal is not annotations for annotations' sake. The goal is to have a dataset that we can use to train a model that will do some layout detection. Producing the best dataset for such a model should be the number 1 requirement. Now, there are several ways of creating such a dataset, and I feel that right now we haven't even touched the question of what we wanted. I think it would be reasonable to have in fact 3 datasets:

  • himalayan pecha that could contain:
    • block prints (Tibetan, Chinese, Mongolian, Buryat)
    • manuscripts (Tibetan, Mongolian, Nepalese)
    • modern editions in pecha format, ideally from different regions
  • modern books
  • palm leaves manuscripts (that could be trained on Nepalese, Khmer and Burmese palm leaves manuscripts)

I can work on something like that, it's not particularly straightforward but it can be ready in a few days

@eroux
Copy link
Contributor

eroux commented Jan 23, 2023

constituting this dataset is part of the work, and people in different domains should be consulted:

  • AI experts
  • persons who know the BDRC database
  • persons who will take general policy decisions for the project

the work consists in:

  • organizing communication between these stakeholders
  • producing a document recording the policy decisions
  • producing scripts to get the right data
  • using the right data in the recipe

@ngawangtrinley
Copy link
Contributor

For layout detection I guess we can do everything regardless of the language. For OCR we will have to limit ourselves to Tibetan since that's what the grant is for.

@eroux
Copy link
Contributor

eroux commented Jan 23, 2023

sure, OCR for non-Tibetan script is out of the scope

@eroux
Copy link
Contributor

eroux commented Jan 23, 2023

(BTW, the estimation of the rest of the work are totally off I think, it can take just 2h total, the juicy part is the dataset)

@ngawangtrinley
Copy link
Contributor

@eroux this RFC is not finalized! It is in the process of "request for comments" and we are consulting you and other specialists before finalizing the work plan and starting to code. Please don't hesitate to give feedback and opinion and we'll integrate it in our plan.

@kaldan007 kaldan007 self-assigned this Jan 23, 2023
@eroux
Copy link
Contributor

eroux commented Jan 24, 2023

as requested, here are a few collections that I think could be good:

  • W26071 (Zhol)
  • W3CN20612 (Dege)
  • W1PD96685 (Cone Kangyur)
  • W1KG26108 (Chinese print)
  • W14322 (Chinese print, different layout)
  • W29468 (Mongolian print)
  • W1KG89102 (Buryat print)
  • W1PD100944 (modern pecha print)
  • W1KG14700 (clear manuscript)
  • https://library.bdrc.io/search?r=bdr:PR1TIBET00&t=Scan&s=title%20forced (old Bodong prints, the NGMPP title cards should be removed)

that's all I can think of right now, but don't hesitate to add to it!

@kaldan007
Copy link
Contributor

@eroux thank you for the list

@kaldan007
Copy link
Contributor

@eroux is it a good idea to zip the images and save in a github directory and share the zip file link to annotator?

@eroux
Copy link
Contributor

eroux commented Jan 24, 2023

this will be too big for github, I think OpenPecha needs a way to provide files online. S3 is a good solution for that, you can upload a zip on s3 an send a URL with a token that's valid 1 week or so. This works well at scale

@kaldan007
Copy link
Contributor

ok

@eroux
Copy link
Contributor

eroux commented Jan 24, 2023

Also, unrelated to the thumbnails issue, we need to determine the URL scheme of the various instances of Prodigy. Currently the instance for cropping is on https://prodigy.bdrc.io, but where would be put the instance for layout? Perhaps having a schema like prodigy.bdrc.io/{recipe_name}/ (so https://prodigy.bdrc.io/bdrc_crop/ and https://prodigy.bdrc.io/layout_analysis/) would be good?

@kaldan007
Copy link
Contributor

@eroux I agree with naming. @ta4tsering what do you think

@ta4tsering
Copy link
Contributor Author

yeah, the naming sounds good. I have just changed the port in the nginx configuration file layout_analysis.conf and prodigy configuration file layout_analysis.json to port 8090 and restarted the server to test and it works fine, now I will look into URL schema.

@eric86y
Copy link
Contributor

eric86y commented Jan 25, 2023

Have you decided upon a layout scheme for annotating images, i.e. which elements you will annotate, etc.? Moreover, can Prodigy handle multi-class annotation/tagging, so that you can also add a script tag to an annotated line?

@eroux
Copy link
Contributor

eroux commented Jan 25, 2023

@eric86y see #10 for the current schema. @kaldan007 perhaps this could be integrated in the RFC?

@eric86y
Copy link
Contributor

eric86y commented Jan 25, 2023

No lines?

@kaldan007
Copy link
Contributor

@eric86y are we suppose to include line? we were think it as separate. Can it be together? If yes we will update the UI accordingly

@eric86y
Copy link
Contributor

eric86y commented Jan 25, 2023

I think it depends on how the annotation pipeline is organized, if you split this then it is ok. But for doing OCR that is based on lines, you'll need robust line detection as well. This can theoretically be done by another team. I was just considering tagging the script type in the process to train a script classification down the road.

@eroux
Copy link
Contributor

eroux commented Jan 25, 2023

I think the plan is to have different steps:

  • one pass to detect pages only
  • one pass to detect layout features only
  • one pass to detect lines of the main text area

this would be done in 3 different prodigy instances

@kaldan007
Copy link
Contributor

I think the plan is to have different steps:

  • one pass to detect pages only
  • one pass to detect layout features only
  • one pass to detect lines of the main text area

this would be done in 3 different prodigy instances

yes we are planning to go this way

@eroux
Copy link
Contributor

eroux commented Jan 25, 2023

@kaldan007 just as a general remark: the RFC is 40% filler (everything in italic), 60% information, perhaps we could either transform the filler into information or remove it? it will make this RFC thing much more appealing (it currently feels like a normal github issue + some unnecessary copy/paste of a template to make it look professional)

My main point though is the following: please create the zip files for the collection on an AWS EC2 instance in the us-east-1 zone (like the prodigy server) so that we can minimize the transfer cost. Thanks!

@kaldan007
Copy link
Contributor

@eroux i hundred percent agree with you. @ngawangtrinley we definitely need to simplify a bit. I am getting complains from our team also.

Regarding zip image collection, @ngawangtrinley says he has images of those collection in hard drvie. So rather than downloading we thought to make the thumbnails from his system. each collections will be having one repo with an empty folder named unique_images where annotator are suppose to copy paste the unique images. the thumnail zip file will added to the release of that repo. the downloadable link will be added to the readme of that repo with instructions to select images.

@eroux
Copy link
Contributor

eroux commented Jan 25, 2023

well, what we need is an automated pipeline that we can trigger, relying on NT maybe having some images on his hard drive may work for the first 10 collections but it doesn't look like this can be used in an automatic workflow... that's just my 2c

@kaldan007
Copy link
Contributor

Shall we do the mentioned procedure for first 10 collections. by the time our annotators are occupied we can assign the fully automated workflow to a developer. What do you think? @eroux

@eroux
Copy link
Contributor

eroux commented Jan 25, 2023

sure!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants