Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFB as a TR (or OCR) consumer #4

Open
keighrim opened this issue Jun 29, 2024 · 0 comments
Open

RFB as a TR (or OCR) consumer #4

keighrim opened this issue Jun 29, 2024 · 0 comments

Comments

@keighrim
Copy link
Member

keighrim commented Jun 29, 2024

This is more like a status tracking thread, rather than an issue.


RFB as it stands now is the primary consumer of text recognition (or optical character recognition) components, and since there are more than one TR apps in the CLAMS appdir, I felt it'd be nice to have a place to put relevant information together in one place, especially regarding I/O relation between TR apps and RFB (and future TR consumers) .

CLAMS TR Apps

And then we also have similarly working - but not conventional TR - apps

Input specs of the TR apps

Possible upstream scenarios

  1. start from blank state (no upstream): TR app should go through the entire video at a certain sample rate and perform transcriptions on all the extracted frame images.
  2. start from point-wise scene type recognition (e.g., SWT): TR app should pick all the relevant TimePoint annotations (relevant TP labels should be passed as a runtime parameter) to transcribe.
  3. start from interval-wise scene type recognition (e.g., SWT + stitcher): TR app should pick a "representative" frame (or a set of frames) for each TimeFrame annotation (again, relevant TF labels should be passed as a runtime parameter) and transcribe all representatives.

Then optionally, it can start from any of above pipeline plus a text localization (TL) app (e.g., EAST)

  1. blank + TL
  2. TP + TL
  3. TF + TL

TODO: assess the current situation with TL, and decide whether using TL actually makes sense, considering the cost and gains.

Output specs of the TR apps

In general, all TR/OCR should return, at minimum,

  • TextDocument (td)
  • BoundingBox (bb-top)
  • Alignment (a-top, between td and bb-top)

And the "td" annotation should cover the entire text content in a single image, and "bb-top" annotation should draw a axis-aligned rectangle that covers entire the text region.

And then if the text localization (TL) feature of the underlying TR engine is capable of returning element-wise bounding boxes (words, lines, blocks, ...)

  • Paragraph (for TL blocks), Sentence (for TL lines), Token (for TL words) (as ling-lower annotation)
  • BoundingBox (bb-lower)
  • Alignment (a-lower, between bb-lower and ling-lower annotations)

Note that even without the lower level bounding boxes, the top-level TextDocument should "render" the line and block information using newline characters. So the lower-level bounding box annotations are only adding coordinate information for those secondary "text" annotations, hoping that coordinates are useful for future processing.

How RFB handles (or should handle) various TR outputs

Assuming that all TR apps are outputing at least the "minimal" output types, RFB (or other TR consumer, including issues like clamsproject/aapb-evaluations#52), should be able to "grab" the correct view by searching for one that contains TextDocument, BoundingBox, Alignment types, since RFB model (as it's implemented now) doesn't care "visual" features like coordinates of lines or words.

However, since the current model relies on the scene type as a part of input string, we still need a way to grab the scene type label by going further down to alignment chains and pulling out the SWT (or alike) TimeFrame annotations. And this is where the "input" spec of the TR apps becomes relevant to this problem.

Future plan

As raised in clamsproject/mmif-visualizer#41 and clamsproject/aapb-evaluations#35 , we'd like to have a concept of app groups (or app patterns) that define a similar (if not identical) I/O patterns for apps that does the same kind of information extraction/transformation.

And this issue to start the attempt of the pattern definition with existing apps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant