Skip to content
This repository has been archived by the owner on Aug 23, 2024. It is now read-only.

MMIF-FIXIT conversion #1

Closed
keighrim opened this issue Jan 12, 2024 · 2 comments
Closed

MMIF-FIXIT conversion #1

keighrim opened this issue Jan 12, 2024 · 2 comments
Labels
✨N New feature or request

Comments

@keighrim
Copy link
Member

keighrim commented Jan 12, 2024

updated based on recent discussions.

New Feature Summary

As a part of aapb package As a new MMIF consumer, we'd like to have a data format converter between MMIF and AAPB json files used in FIXIT tool internally. This should enable us data exchange between CLAMS pipelines and FIXIT (crowd-based ASR correction tool), and import FIXIT-ed transcript into CLAMS pipelines for further processing.

Additional context

Here are some example AAPB json files provided by @owencking.

cpb-aacip-3fde6d4dc0b-transcript.json
cpb-aacip-05a9a67fd3d-transcript.json

and some notes from him;

I think the format is pretty self-explanatory. Just two notes: (1) Notice the speaker_id key. We need to leave that in there, but right now it's not in use. Our convention is to start with 1 and increment the value with each transcript segment. (2) In the future, we will add a provenance key at the top of the JSON file with an object containing a few keys related to the origin of the transcript. We're still figuring out exactly what info to put in there. It will probably be some of the same info that appears at the beginning of a MMIF file.

More examples on google drive shared folder.

Implementation requirements

So the conversion needs to support both directions; MMIF to AAPB (m2a, or outbound) and AAPB to MMIF (a2m, or inbound).

m2a conversion

  1. On the top level of AAPB json, there will be provenance key added to contain information about source (MMIF) and FIXIT process. At the moment, there isn't a complete settlement on how the information will be organized except for that it will be a nested dictionary with fairly flat structure. Thus for now, the convert should consider putting a placeholder for that "meta" key.
  2. In AAPB json, the language key on the top level has been used to mostly keep en-US values. In MMIF spec, we have @language key in text documents, but we haven't been used regional code. See the following for how language description is specified in the current
  3. For elements in parts field in the AAPB json, GBH seems to have been using a fixed duration (5 sec) to generate "segments" of transcript text that populates the parts list. When Brandeis team worked on MMIF > VTT conversion in MMIF-viz, we used fixed number of tokens (8 tokens) to generate VTT segments. However, as the primary ASR app in the CLAMS platform as of now is whisper, we want to re-use sentencization from the whisper as the "chunks" to show on FIXIT interface.

a2m conversion

Primary goal for the a2m conversion must be generating a MMIF views that's identical enough to CLAMS ASR MMIF so that any downstream CLAMS pipeline can be applicable to both types of "ASR" input. Also note that,

  1. not all CLAMS ASR outputs will be FIXIT-ed
  2. not all FIXIT output has originating MMIF (non-CLAMS origin)
@keighrim keighrim added the ✨N New feature or request label Jan 12, 2024
@keighrim
Copy link
Member Author

Next version of whisper wrapper will have the Sentence annotation objects. But here's a teaser output file;

whisper-v4.out.json

@keighrim
Copy link
Member Author

As the discussion on the consumer specification (clamsproject/clams-python#93) hasn't made significant progress, I decided to move this feature back to "aapb" package as originally planned via #2 and clamsproject/clams-utils#4. The PR implements the conversion as described in this issue, which does not include conversion from AAPB-json to MMIF format. Any future development regarding this conversion should be discussed in the clams-utils repo now on.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
✨N New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant