MMIF-FIXIT conversion #1

keighrim · 2024-01-12T17:47:56Z

updated based on recent discussions.

New Feature Summary

~~As a part of aapb package~~ As a new MMIF consumer, we'd like to have a data format converter between MMIF and AAPB json files used in FIXIT tool internally. This should enable us data exchange between CLAMS pipelines and FIXIT (crowd-based ASR correction tool), and import FIXIT-ed transcript into CLAMS pipelines for further processing.

Additional context

Here are some example AAPB json files provided by @owencking.

cpb-aacip-3fde6d4dc0b-transcript.json
cpb-aacip-05a9a67fd3d-transcript.json

and some notes from him;

I think the format is pretty self-explanatory. Just two notes: (1) Notice the speaker_id key. We need to leave that in there, but right now it's not in use. Our convention is to start with 1 and increment the value with each transcript segment. (2) In the future, we will add a provenance key at the top of the JSON file with an object containing a few keys related to the origin of the transcript. We're still figuring out exactly what info to put in there. It will probably be some of the same info that appears at the beginning of a MMIF file.

More examples on google drive shared folder.

Implementation requirements

So the conversion needs to support both directions; MMIF to AAPB (m2a, or outbound) and AAPB to MMIF (a2m, or inbound).

m2a conversion

On the top level of AAPB json, there will be provenance key added to contain information about source (MMIF) and FIXIT process. At the moment, there isn't a complete settlement on how the information will be organized except for that it will be a nested dictionary with fairly flat structure. Thus for now, the convert should consider putting a placeholder for that "meta" key.
In AAPB json, the language key on the top level has been used to mostly keep en-US values. In MMIF spec, we have @language key in text documents, but we haven't been used regional code. See the following for how language description is specified in the current
- MMIF spec
- and mmif-python SDK
For elements in parts field in the AAPB json, GBH seems to have been using a fixed duration (5 sec) to generate "segments" of transcript text that populates the parts list. When Brandeis team worked on MMIF > VTT conversion in MMIF-viz, we used fixed number of tokens (8 tokens) to generate VTT segments. However, as the primary ASR app in the CLAMS platform as of now is whisper, we want to re-use sentencization from the whisper as the "chunks" to show on FIXIT interface.

a2m conversion

Primary goal for the a2m conversion must be generating a MMIF views that's identical enough to CLAMS ASR MMIF so that any downstream CLAMS pipeline can be applicable to both types of "ASR" input. Also note that,

not all CLAMS ASR outputs will be FIXIT-ed
not all FIXIT output has originating MMIF (non-CLAMS origin)

The text was updated successfully, but these errors were encountered:

keighrim · 2024-01-22T23:24:36Z

Next version of whisper wrapper will have the Sentence annotation objects. But here's a teaser output file;

whisper-v4.out.json

keighrim · 2024-08-23T20:22:48Z

As the discussion on the consumer specification (clamsproject/clams-python#93) hasn't made significant progress, I decided to move this feature back to "aapb" package as originally planned via #2 and clamsproject/clams-utils#4. The PR implements the conversion as described in this issue, which does not include conversion from AAPB-json to MMIF format. Any future development regarding this conversion should be discussed in the clams-utils repo now on.

keighrim added the ✨N New feature or request label Jan 12, 2024

keighrim mentioned this issue Jan 29, 2024

Make summarizer available on PyPI clamsproject/mmif-summarizer#12

Open

keighrim transferred this issue from clamsproject/clams-utils Feb 26, 2024

keighrim mentioned this issue Jul 11, 2024

adding aapbjson converter clamsproject/clams-utils#4

Merged

keighrim closed this as completed Aug 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MMIF-FIXIT conversion #1

MMIF-FIXIT conversion #1

keighrim commented Jan 12, 2024 •

edited

Loading

keighrim commented Jan 22, 2024

keighrim commented Aug 23, 2024

MMIF-FIXIT conversion #1

MMIF-FIXIT conversion #1

Comments

keighrim commented Jan 12, 2024 • edited Loading

New Feature Summary

Additional context

Implementation requirements

m2a conversion

a2m conversion

keighrim commented Jan 22, 2024

keighrim commented Aug 23, 2024

keighrim commented Jan 12, 2024 •

edited

Loading