Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prevent using of "short" annotation ID and release MMIF 2.0 #228

Open
keighrim opened this issue Jun 18, 2024 · 6 comments
Open

prevent using of "short" annotation ID and release MMIF 2.0 #228

keighrim opened this issue Jun 18, 2024 · 6 comments

Comments

@keighrim
Copy link
Member

keighrim commented Jun 18, 2024

Because

The generosity of allowing short form for annotation id values provides almost no benefit (besides saving some bytes) while adding lots of technical debt and bugs. For example, we have this code in visualizer

https://github.com/clamsproject/mmif-visualizer/blob/9019c154fccde11a605b77d412ddaca3e7be566c/ocr.py#L65-L68

but source and target values in Alignment annotations almost always use the "long" form (prefixed with view ID) of the annotation ID since lots of alignments are done across views, and hence these checks are destined to fail and introduce silent errors.

I suggest we allow only using long form of the ID in everywhere in MMIF, so that ann.id and ann.long_id always return the same value. This is major change in MMIF spec, hence leads to MMIF 2.0.

Done when

Either

  1. we update all existing code that uses == checks with ann.id value to a fixed standard and educate all existing and future developers to use the same implementation,
  2. we simply disallow "short" form of annotation ID in MMIF world and change ann.id to return ann.long_id.

Additional context

No response

@marcverhagen
Copy link
Contributor

I can see some value in always having view_id:annotation_id, but I do also see value in not stating the obvious inside a view.

However, I really like the idea of having ann.id be a property that returns the full value. As long as that is reasonably efficient then no other changes would be needed I think. If that is true then option 2 seems superior over option 1.

@marcverhagen
Copy link
Contributor

Oh, I just realized that what I thought was option 2 is really another option.

@keighrim
Copy link
Member Author

If ann.id returns the long form, I believe "id" field of the annotation object in the JSON serialization must have the long form, so that SDK-based ann.id call and general JSON operation ann["id"] behave equivalently.

@keighrim
Copy link
Member Author

keighrim commented Jun 25, 2024

While working on clamsproject/mmif-python#285, I realized that allowing the short form for annotation id values within the view "scope" actually introduces un-recoverable ambiguity, and hence we MUST stop doing this. Here's an example of output MMIF from an imaginary spell correction app.

{
  ...
  "documents": [
    {
      "@type": "https://mmif.clams.ai/vocabulary/TextDocument/v1/",
      "properties": {
        "id": "td1",
        "location": "file:///var/archive/spell-errors.txt" }
    }
  ],
  "views": [

    {
      "id": "v1",
      "metadata": {
        "app": "http://apps.clams.ai/awesome-spell-corrector/v7", 
        ...
      },
      "annotations": [
        {
          "@type": "https://mmif.clams.ai/vocabulary/TextDocument/v1/",
          "properties": {
            "id": "td1",
            "text": { "@value": "very correctly spelled text" }
          }
        },
        {
          "@type": "https://mmif.clams.ai/vocabulary/Alignment/v1/",
          "properties": {
            "id": "a1",
            "source": "td1", 
            "target": "td1"   # ???
          }
        },
        ...
      ]
    }
  ]
}

@marcverhagen
Copy link
Contributor

Yes, that does appear to be an oversight since we nowhere stated that annotation identifiers should not be the same as identifiers in the documents list.

@keighrim
Copy link
Member Author

Last time I and @marcverhagen talked about this in a in-person meeting, we temporarily decided to update the SDK code to use "long_id" everywhere, but keep the MMIF version 1.x. That seemed to work with MMIF spec version. But now that I think of mmif-python version, and because the mmif-python SDK version is "bound" to the spec version down to minor level (first two ints) , I don't think keeping the MMIF version to 1.x will work at all.

If we update the SDK to use long_id everywhere, the new SDK won't be able to read existing mmif files. Hence, to mark the boundary between that compatibility I firmly believe that we MUST release mmif-python 2.x. But that can't be done without releasing MMIF 2.x, because of the version coupling. This problem already raised three years ago (#14 (comment)) but never properly addressed back then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

No branches or pull requests

2 participants