Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Centralizing and Synchronizing Annotation Projects Across Distinct Document Sets #39

Closed
adelavega opened this issue Nov 1, 2024 · 14 comments

Comments

@adelavega
Copy link
Contributor

adelavega commented Nov 1, 2024

Problem:

  • Each annotation project operates on its unique set of documents.
  • The goal is to establish a highly efficient, compact set of deeply annotated documents.

Key Considerations:

  • Some annotations generated by Pubget are still in an older format-- we need tooling to update old annotations
  • Certain projects will remain distinct by nature (e.g., fMRI versus diffusion studies), yet it’s essential to enable synchronization across annotations for consistency and utility.

Proposed Solution:

  1. Centralize Source Documents:

    • Place all source documents at the top level of the repository to facilitate organization and access.
  2. Distinguish Document Sources and Annotation Subsets:

    • Clearly differentiate between document sources (e.g., specific Pubget searches) and the annotated subset of documents. For instance, a search might retrieve 1,000 articles, but only the first 100 might undergo detailed annotation.
  3. Streamlined Access for Annotation Projects:

    • Allow each project to project down to its annotated subset to simplify access within Labelbuddy, making subsets easily accessible for project-specific workflows.

The main issue with this approach is that it will require quite a bit of work on updating the existing tooling that Jerome wrote, including the website etc.

Alternative Solution: Streamlined Organization without Full Centralization

  1. Maintain Project-Specific Document Sets:

    • Keep each project’s document set within its own folder to avoid merging everything at the top level. This preserves the existing structure and minimizes changes.
  2. Implement Annotation updating tools:

    • Use pubget to reprocess all annotations in the whole project to the latest versions, making them mergable
  3. Implement tools for better merging of documents in existing projects:

    • Add tools to see which documents have been documented already in other projects, and search within those documents using certain criteria (i.e. new search terms), maximizing the chance that you annotated documents with existing annotations.
    • If new documents are needed for a specific annotation project, a new pubget search can be done and only top n documents included.

Other relates issues:
neuroquery/pubget#46

@adelavega
Copy link
Contributor Author

I'm starting to learn towards the latter option.

I think the key would be to add tooling to search documents in other project (e.g. full text search) to see if they meet your criteria, and copy those that meet those criteria to your new project folder.

This would make it easier to say something like: I want "resting-state" documents that were also annotated by "participant_demographics". Let's say 52 of those match, then copy those into my new annotation project folder.

In addition, we might want to allow for sub-projects, so we can then group together all the "cobidas" related sub-annotation projects together (in recognition that there's non overlapping sets of docs for cobidas, and its easier to annotatate one thing vs many).

@adelavega
Copy link
Contributor Author

I'm realizing that Jerome's excellent documentation actually covers this under Reusing documents from another project

However, the proposed solution is just to do a symlink to another project. I wonder if a better solution is to allow users to remix documents from other projects?

@jeromedockes
Copy link
Collaborator

thanks @adelavega that's very interesting. indeed it would be great to use the same documents in different annotation projects as much as possible, as it might help study associations between the different things being annotated.
For that as you mention we need a way to explore already existing documents and define new subsets that we will annotate. Maybe there are some lessons learned from the design of the neurosynth-compose StudySets & co that we can apply here?

this might also be an opportunity to revisit how we store documents. ATM the JSON files to annotate are directly in the github repository which is simple and guarantees we always have all the correct documents but means we have to be very careful not to add too many, so there may be a better solution.

However, the proposed solution is just to do a symlink to another project. I wonder if a better solution is to allow users to remix documents from other projects?

yes the symlink approach is probably too limiting because it assumes you will basically work on the same document set as another project, rather than regroup documents dispersed across many files. I think having a couple of use cases, example projects that need a more complex selection of documents, would help a lot in finding a better solution. I'm guessing we could do something similar to the symlinks but slightly more expressive: in the project directory we could have a file containing a list of pmcids and where to find them in the repo, and reconstruct the labelbuddy file when we run start_project.py or something like that?

for finding and selecting the documents we could use the database created by make database, possibly adding more information to it. the question is how to make something usable without spending a lot of effort on building an interface for inspecting and choosing documents

@adelavega
Copy link
Contributor Author

that sounds similar to what I was thinking: the project specific label-buddy files could be "temporary" but can be recreated from a central "repository" on the fly given a list of PMCIDs that are annotated in a given project.

regarding where to store this "repository", perhaps we can use datalad/github-annex or something if size becomes an issue, but we can probably punt that issue for now.

also, it's probably best for the repository to include the label-buddy JSONs, instead of recreating them, in case the text changes which would make the annotations hard to interpret. i would not want to rely on re-downloading them as they could chane. specific documents can be remixed with others into new json-l files, however.

@adelavega
Copy link
Contributor Author

in terms of use cases here is one:

we want to annotate various aspects of COBIDAS, but it is rather large. So for some sub-projects that only cover a modality (e.g. fMRI/Diffusion), you only want to annotate relevant documents. it's annoying to have to skip irrelevant ones, so ideally you effectively filter those out using key-words.

however, there are also aspects of cobidas that would also span modalities, so you'd want to annotate all of those documents on another dimension (say: data sharing statements).

it's possible that this use case could be handled by symlinks if you thoughtfully create the subsets prior to starting annotation.

@adelavega
Copy link
Contributor Author

finally, in terms of finding relevant papers, i agree this is challenging. perhaps this can be a less controlled process which we facilitate through the python interface, but may still require some amount of user intervention to create a final list of PMCIDs to include in a project.

then it is added to a project and a script populates a "view" of labelbuddy files for annotation.

@adelavega
Copy link
Contributor Author

adelavega commented Nov 8, 2024

I took a look at the existing overlap between project (excluding symlinks, so there is actually more overlap in reality)

image

There are 4477 documents in labelbuddy files, but only 3697 of them are unique.
In addition, of these, only 1000 have any form of annotation. ~1600 when you include the symlinked docs.

This tells me that we are storing way more document data than we need to, and because there's so many documents available, we are likely to non-overlap annotations unless are deliberate.

@adelavega
Copy link
Contributor Author

I'm still not 100% sure we should do here, but my feeling is we're getting too scattered, so either we need to have more discipline about how we get documents for annotation, or centralize all documents and check out temporary views for annotation only

@adelavega
Copy link
Contributor Author

We could also take a more hands off approach and add a "shared_documents" folder, and do a 1 time re-organization right now where we group documents into useful subsets

@jeromedockes
Copy link
Collaborator

I agree with everything you said above :) thanks for investigating the overlap between projects. I'm not surprised that annotations are very sparse, and we do need to make an effort to reduce the documents to a subset we realistically hope to annotated completely before starting annotations in a new project. the queries on pubmed tend to give a large number of results so if every project re-downloads documents there will never be any overlap.

I guess we need:

  • a dump of all the pubget results out of which at least one document was used in this repo. it should be append-only, probably does not need version control, may contain duplicated pmcids (possibly even different versions of the same article), although we should try to avoid it, when an article shows up in several queries. we were trying to use OSF for that
  • a pool of labelbuddy json files extracted from the pubget downloads which are or will be annotated. as soon as a document has an annotation the corresponding json file should be checked into the git repo so we are sure we can find the exact text that was annotated. it does not really matter where in the repo, the way annotations are linked to documents is through the hash of the documents' text so we know we can identify the document reliably regardless of duplicated pmcids, changes in pmc etc.
  • a way to select a subset of it that we will annotate in a new project. we already have a database of the documents in the repo and their annotations which can be useful for that, and this can be combined with other logic in python scripts, or even going through a bunch of articles in labelbuddy and tagging those we want to keep for a project
  • for each project, a list of the document ids that it uses so that a labelbuddy database can be created locally when someone decides to start contributing to a project. the "start on this project" script reads the list of documents, creates a temporary json file, imports them into the database. neither the database nor the json file get checked into version control, only the exported annotations

@adelavega
Copy link
Contributor Author

adelavega commented Nov 14, 2024

great!

I started working on centralizing the annotations by doing something similar to what you outlined.

My approach was to:

  1. look at all existing documents and re-download them using pubget in a temporary directory (since this can always be recreated we don't need to keep it)
  2. check in all the labelbuddy documents from this dump into a central documents/ directory (maybe should only keep those w/ existing annotations, but currently I kept all those referenced in projects).
  3. each project has pmcids.txt file which outlines which of the centralized documents are relevant for this project

the above is a one time migration operation

still to do:

  • write a script that can "check out" documents for a project (from the pmicds.txt file) from the centralized location into a {project_name}/_documents directory inside each project, to open in labelbuddy to annotate.
    _documents folders are in .gitignore.
    (it seems okay to me for annotations to be kept within a project)

then in the future if somebody wants to get new documents for a project, there could be a script that:

  • given a pubmedcentral search (which is specific enough), and a target number of documents (e.g. 100), finds the overlap between that search and already annotated documents, prioritizing existing documents.
    if, for example, 100 are requested, and 70 overlap with existing documents, only 30 new articles need to be added

we could also add a "clean up" script which removes all documents that have no annotations in any projects.


  • the only potential issue I see with this approach is its less intuitive, especially if finding target documents is hard to do with a specific PMC query. But we could also allow users to just come up with a custom pmcids.txt file with whatever method they see fit (including manually browsing documents, or doing a broader search than necessary and then running the clean up script).
  • also perhaps you are right and only document with annotations should be checked in to save space.

@jeromedockes
Copy link
Collaborator

the above is a one time migration operation

sorry I hadn't realized that, so you can ignore most of my comments on the pr -- we don't care too much about details of the script if we'll only run it once. but of course it should be obvious moving all documents into one place only needs to be done once 😅

since this can always be recreated we don't need to keep it

I wonder if we want to keep the pubget output somewhere? if later the article in pubmed is different or not there anymore and we want to get the original full xml that produced the thing we annotated? maybe not worth the storage & effort

_documents folders are in .gitignore.

those you could also discard after populating the labelbuddy database

re. getting new documents, I agree the script you describe would be useful but for now we can say "come up with your pmcid list" is the current solution

one advantage of not keeping documents without annotations in inactive projects is that if I am lazy I will prefer to use documents that are already available in the repo, and if it only contains documents that are annotated or very likely to be annotated I will end up annotating the same documents as other projects which is what we want

@adelavega
Copy link
Contributor Author

My bad, I should have marked the PR as WIP, which it is.

Not sure if we even need to keep track of the one time migration script then

As far as original pubget output, we could keep it in a folder that is not checked in by default.

Yes, agree, for now we can punt on the search scripts and just tell projects to come up with a pmcid list in a way that best fits their needs. Also agree that it's good to nudge projects to re-use docs. Will add that to PR.

@adelavega
Copy link
Contributor Author

Closed by #40

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants