-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
input MMIF spec for RFB app #2
Comments
@keighrim -- My approach for this was actually to work backwards from For context, this is the function I used to setup the annotations (it pulls out some extra info that the app won't need like the confidence score or the actual But I'm not sure how this would work inside the app. I hard-coded the view ids here but it should be dynamic. def gather_ocr_data(data_dir: str) -> List[Tuple[str, int, str, float, str]]:
"""
Takes a directory of mmif files with views from SWT and DocTR.
Iterates over each TextDocument in the DocTR view, and obtains the corresponding SWT label via Alignments.
Returns a list of tuples, where each tuple contains the guid, timepoint, scene, confidence, and ocr text.
:param data_dir: directory containing mmif files
:return: list of tuples in the form [(guid, timepoint, scene, confidence, ocr text), ...]
"""
path = pathlib.Path(data_dir)
outputs = []
for filename in tqdm(list(path.glob('*.mmif'))):
with open(filename, 'r') as f:
curr_mmif = json.load(f)
curr_mmif = Mmif(curr_mmif)
guid = filename.stem.split('.')[0]
# grab the necessary views
# in this batch, swt is in 'v_0', doctr chyrons are in 'v_2', doctr credits are in 'v_3'
# TODO: Figure out an app-agnostic way of doing this?
swt_view = curr_mmif.get_view_by_id('v_0')
doctr_view = curr_mmif.get_view_by_id('v_3')
timeframes = swt_view.get_annotations(at_type=AnnotationTypes.TimeFrame)
# map tp representative to timeFrame annotation
timepoints2frames = {tp_rep: tf for tf in timeframes for tp_rep in tf.get('representatives')}
# map tp id to timePoint value
timepoints = list(swt_view.get_annotations(at_type=AnnotationTypes.TimePoint))
timepoints = {tp.get('id'): tp.get('timePoint') for tp in timepoints}
for textdoc in doctr_view.get_documents():
ocr_text = rf'{textdoc.text_value}'
td_id = textdoc.id
# To get the swt label, we need the alignment between textdocument and timepoint id.
# Then we use that timepoint id to get the timeframe it represents.
# From the timeframe, get the label and confidence.
td_alignment = list(doctr_view.get_annotations(AnnotationTypes.Alignment, target=td_id))
timepoint = td_alignment[0].get('source') # e.g. "v_0:tp_54"
tp_id = timepoint.split(':')[1]
timepoint = timepoints[tp_id]
scene_label = timepoints2frames[tp_id].get('label')
confidence = timepoints2frames[tp_id].get('classification')[scene_label]
outputs.append((guid, timepoint, scene_label, confidence, ocr_text))
# Ex: (cpb-aacip-526-dj58c9s78v, 1187187, chyron, 0.5742753098408380, Glen Miller)
return outputs |
Documenting some updates on this. As discussed in a previous meeting, @haydenmccormick and I decided to just take the For the output, as we decided earlier, RFB will generate TextDocument annotations, each containing a raw CSV string. Additionally, we decided to have it generate secondary alignments, mapping the source alignment within the OCR view, to the target RFB TextDocument. |
@keighrim -- We briefly mentioned the eventual inclusion of a runtime parameter that allows the user to define their own labelmap for the TimePoints. One important thing to note is that the current RFB model/parser will be somewhat inflexible for this. We made the assumption that SWT scene label for a credits frame would always be labeled If a user of the swt-ocr-rfb pipeline wants to use a different custom labelset, then there's more potential for the ner model to give unexpected results, and the parser would definitely fail. It's not an immediate concern to us, but in the future these components would need to be reworked if that extra flexibility is needed. |
ATM I can think of very simple trick to make the the app a little bit flexible. For example, user can call the app with parameters like this
And in the app code, def _annotate(self, mmif, **parameters):
# do stuff
# then retrieve relevant annotations, with the new alignment caching
for view in req_mmif.get_all_views_contain(AnnotationTypes.TimePoint):
for tp_ann in view.get_annotations(AnnotationTypes.TimePoint):
if tp_ann.get_property('label') in parameters['label']:
for aligned in tp_ann.get_all_aligned():
if aligned.at_type == DocumentTypes.TextDocument:
lm_input = prepare_bert_input(parameter['mode'], aligned.text_value)
# do more stuff with LM
... This is obviously just a stopgap, and you're right in that we need some rework in the future to achieve a real flexibility. But the basic direction here successfully decouples the label set used for finetuning BERT (from TF's SWT v4 era) and the labelset used in TP classification in SWT. |
One more possible problem with taking the This is relatively minor issue, and we can deal with it by cross filter in a post process. But I think something like this can be a bit safer approach def _annotate(self, mmif, **parameters):
# do stuff
# then retrieve relevant annotations, with the new alignment caching
for view in req_mmif.get_all_views_contain(AnnotationTypes.TimeFrame):
for tf_ann in view.get_annotations(AnnotationTypes.TimeFrame):
if tf_ann.get_property('label') in parameters['label']: # note that here we use "postbinned" label names, not the single-letter raw labels
for tp_ann in [mmif[rep_id] for rep_id in tf_ann.get_property('representatives')]:
for aligned in tp_ann.get_all_aligned():
if aligned.at_type == DocumentTypes.TextDocument:
lm_input = prepare_bert_input(parameter['mode'], aligned.text_value)
# do more stuff with LM
... |
… `credit` label back to `credits` - clamsproject/mmif#188 (comment) - clamsproject/app-role-filler-binder#2 (comment) - importing `app.py` from `metadata.py` requires the entire dependencies to be in place. I change the dependency direction so that `metadata.py` can run just with `clams-python` dependency
I might have misunderstood, but I thought docTR only selects the |
Yeah, you're right. When an OCR app creates its |
Okay so supposing I try changing the app logic following your suggestion, would the input to RFB just be Should we then change the |
Based off from the model input (https://github.com/clamsproject/app-role-filler-binder-old/issues/3) the RFB app would expect two information
Current target pipeline that RFB will use as upstream is SWT-docTR. In a MMIF output from the pipeline will have two views (swt view, ocr view) and will have these annotation objects
TimePoint
(swt view): hold time point and "raw" scene classification label (C
,S
, ...)TimeFrame
(swt view): hold time-wise start/end and representative timepoints, and "binned" label (slate
,credits
, ... ) These binned labels are the labels that RFB will useTextDocument
(ocr view): hold text contentsAlignment
(ocr view): anchors theTextDocument
s back to theTimePoint
s.Given that, what RFB app should specify as input types are
TimeFrame
where the scene types are recordedTextDocument
where the text contexts are recordedThen internally, it look for
TimeFrame
annotations first and grab all theTextDocument
"aligned" to the frame, and aggregate necessary information from two annotations to perform the inference.Tricky part here is there is no explicit
Alignement
annotation betweenTimeFrame
andTextDocument
, and instead we haveTimeFrame
withrepresentatives
attribute pseudo align aTF
andTP
, then there are explicitAlignment
annotations betweenTP
andTD
.@kelleyl you also mentioned a very relevant problem with the llava captioner app. Do you have anything else to add to the problem description?
The text was updated successfully, but these errors were encountered: