You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Aug 23, 2024. It is now read-only.
As a part of aapb package As a new MMIF consumer, we'd like to have a data format converter between MMIF and AAPB json files used in FIXIT tool internally. This should enable us data exchange between CLAMS pipelines and FIXIT (crowd-based ASR correction tool), and import FIXIT-ed transcript into CLAMS pipelines for further processing.
Additional context
Here are some example AAPB json files provided by @owencking.
I think the format is pretty self-explanatory. Just two notes: (1) Notice the speaker_id key. We need to leave that in there, but right now it's not in use. Our convention is to start with 1 and increment the value with each transcript segment. (2) In the future, we will add a provenance key at the top of the JSON file with an object containing a few keys related to the origin of the transcript. We're still figuring out exactly what info to put in there. It will probably be some of the same info that appears at the beginning of a MMIF file.
So the conversion needs to support both directions; MMIF to AAPB (m2a, or outbound) and AAPB to MMIF (a2m, or inbound).
m2a conversion
On the top level of AAPB json, there will be provenance key added to contain information about source (MMIF) and FIXIT process. At the moment, there isn't a complete settlement on how the information will be organized except for that it will be a nested dictionary with fairly flat structure. Thus for now, the convert should consider putting a placeholder for that "meta" key.
In AAPB json, the language key on the top level has been used to mostly keep en-US values. In MMIF spec, we have @language key in text documents, but we haven't been used regional code. See the following for how language description is specified in the current
For elements in parts field in the AAPB json, GBH seems to have been using a fixed duration (5 sec) to generate "segments" of transcript text that populates the parts list. When Brandeis team worked on MMIF > VTT conversion in MMIF-viz, we used fixed number of tokens (8 tokens) to generate VTT segments. However, as the primary ASR app in the CLAMS platform as of now is whisper, we want to re-use sentencization from the whisper as the "chunks" to show on FIXIT interface.
a2m conversion
Primary goal for the a2m conversion must be generating a MMIF views that's identical enough to CLAMS ASR MMIF so that any downstream CLAMS pipeline can be applicable to both types of "ASR" input. Also note that,
not all CLAMS ASR outputs will be FIXIT-ed
not all FIXIT output has originating MMIF (non-CLAMS origin)
The text was updated successfully, but these errors were encountered:
As the discussion on the consumer specification (clamsproject/clams-python#93) hasn't made significant progress, I decided to move this feature back to "aapb" package as originally planned via #2 and clamsproject/clams-utils#4. The PR implements the conversion as described in this issue, which does not include conversion from AAPB-json to MMIF format. Any future development regarding this conversion should be discussed in the clams-utils repo now on.
updated based on recent discussions.
New Feature Summary
As a part ofAs a new MMIF consumer, we'd like to have a data format converter between MMIF and AAPB json files used in FIXIT tool internally. This should enable us data exchange between CLAMS pipelines and FIXIT (crowd-based ASR correction tool), and import FIXIT-ed transcript into CLAMS pipelines for further processing.aapb
packageAdditional context
Here are some example AAPB json files provided by @owencking.
cpb-aacip-3fde6d4dc0b-transcript.json
cpb-aacip-05a9a67fd3d-transcript.json
and some notes from him;
More examples on google drive shared folder.
Implementation requirements
So the conversion needs to support both directions; MMIF to AAPB (m2a, or outbound) and AAPB to MMIF (a2m, or inbound).
m2a conversion
provenance
key added to contain information about source (MMIF) and FIXIT process. At the moment, there isn't a complete settlement on how the information will be organized except for that it will be a nested dictionary with fairly flat structure. Thus for now, the convert should consider putting a placeholder for that "meta" key.language
key on the top level has been used to mostly keepen-US
values. In MMIF spec, we have@language
key in text documents, but we haven't been used regional code. See the following for how language description is specified in the currentmmif-python
SDKparts
field in the AAPB json, GBH seems to have been using a fixed duration (5 sec) to generate "segments" of transcript text that populates the parts list. When Brandeis team worked on MMIF > VTT conversion in MMIF-viz, we used fixed number of tokens (8 tokens) to generate VTT segments. However, as the primary ASR app in the CLAMS platform as of now is whisper, we want to re-use sentencization from the whisper as the "chunks" to show on FIXIT interface.a2m conversion
Primary goal for the a2m conversion must be generating a MMIF views that's identical enough to CLAMS ASR MMIF so that any downstream CLAMS pipeline can be applicable to both types of "ASR" input. Also note that,
The text was updated successfully, but these errors were encountered: