-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Centralizing and Synchronizing Annotation Projects Across Distinct Document Sets #39
Comments
I'm starting to learn towards the latter option. I think the key would be to add tooling to search documents in other project (e.g. full text search) to see if they meet your criteria, and copy those that meet those criteria to your new project folder. This would make it easier to say something like: I want "resting-state" documents that were also annotated by "participant_demographics". Let's say 52 of those match, then copy those into my new annotation project folder. In addition, we might want to allow for sub-projects, so we can then group together all the "cobidas" related sub-annotation projects together (in recognition that there's non overlapping sets of docs for cobidas, and its easier to annotatate one thing vs many). |
I'm realizing that Jerome's excellent documentation actually covers this under However, the proposed solution is just to do a symlink to another project. I wonder if a better solution is to allow users to remix documents from other projects? |
thanks @adelavega that's very interesting. indeed it would be great to use the same documents in different annotation projects as much as possible, as it might help study associations between the different things being annotated. this might also be an opportunity to revisit how we store documents. ATM the JSON files to annotate are directly in the github repository which is simple and guarantees we always have all the correct documents but means we have to be very careful not to add too many, so there may be a better solution.
yes the symlink approach is probably too limiting because it assumes you will basically work on the same document set as another project, rather than regroup documents dispersed across many files. I think having a couple of use cases, example projects that need a more complex selection of documents, would help a lot in finding a better solution. I'm guessing we could do something similar to the symlinks but slightly more expressive: in the project directory we could have a file containing a list of pmcids and where to find them in the repo, and reconstruct the labelbuddy file when we run for finding and selecting the documents we could use the database created by |
that sounds similar to what I was thinking: the project specific label-buddy files could be "temporary" but can be recreated from a central "repository" on the fly given a list of PMCIDs that are annotated in a given project. regarding where to store this "repository", perhaps we can use datalad/github-annex or something if size becomes an issue, but we can probably punt that issue for now. also, it's probably best for the repository to include the label-buddy JSONs, instead of recreating them, in case the text changes which would make the annotations hard to interpret. i would not want to rely on re-downloading them as they could chane. specific documents can be remixed with others into new json-l files, however. |
in terms of use cases here is one: we want to annotate various aspects of COBIDAS, but it is rather large. So for some sub-projects that only cover a modality (e.g. fMRI/Diffusion), you only want to annotate relevant documents. it's annoying to have to skip irrelevant ones, so ideally you effectively filter those out using key-words. however, there are also aspects of cobidas that would also span modalities, so you'd want to annotate all of those documents on another dimension (say: data sharing statements). it's possible that this use case could be handled by symlinks if you thoughtfully create the subsets prior to starting annotation. |
finally, in terms of finding relevant papers, i agree this is challenging. perhaps this can be a less controlled process which we facilitate through the python interface, but may still require some amount of user intervention to create a final list of PMCIDs to include in a project. then it is added to a project and a script populates a "view" of labelbuddy files for annotation. |
I took a look at the existing overlap between project (excluding symlinks, so there is actually more overlap in reality) There are 4477 documents in labelbuddy files, but only 3697 of them are unique. This tells me that we are storing way more document data than we need to, and because there's so many documents available, we are likely to non-overlap annotations unless are deliberate. |
I'm still not 100% sure we should do here, but my feeling is we're getting too scattered, so either we need to have more discipline about how we get documents for annotation, or centralize all documents and check out temporary views for annotation only |
We could also take a more hands off approach and add a "shared_documents" folder, and do a 1 time re-organization right now where we group documents into useful subsets |
I agree with everything you said above :) thanks for investigating the overlap between projects. I'm not surprised that annotations are very sparse, and we do need to make an effort to reduce the documents to a subset we realistically hope to annotated completely before starting annotations in a new project. the queries on pubmed tend to give a large number of results so if every project re-downloads documents there will never be any overlap. I guess we need:
|
great! I started working on centralizing the annotations by doing something similar to what you outlined. My approach was to:
the above is a one time migration operation still to do:
then in the future if somebody wants to get new documents for a project, there could be a script that:
we could also add a "clean up" script which removes all documents that have no annotations in any projects.
|
sorry I hadn't realized that, so you can ignore most of my comments on the pr -- we don't care too much about details of the script if we'll only run it once. but of course it should be obvious moving all documents into one place only needs to be done once 😅
I wonder if we want to keep the pubget output somewhere? if later the article in pubmed is different or not there anymore and we want to get the original full xml that produced the thing we annotated? maybe not worth the storage & effort
those you could also discard after populating the labelbuddy database re. getting new documents, I agree the script you describe would be useful but for now we can say "come up with your pmcid list" is the current solution one advantage of not keeping documents without annotations in inactive projects is that if I am lazy I will prefer to use documents that are already available in the repo, and if it only contains documents that are annotated or very likely to be annotated I will end up annotating the same documents as other projects which is what we want |
My bad, I should have marked the PR as WIP, which it is. Not sure if we even need to keep track of the one time migration script then As far as original pubget output, we could keep it in a folder that is not checked in by default. Yes, agree, for now we can punt on the search scripts and just tell projects to come up with a pmcid list in a way that best fits their needs. Also agree that it's good to nudge projects to re-use docs. Will add that to PR. |
Closed by #40 |
Problem:
Key Considerations:
Proposed Solution:
Centralize Source Documents:
Distinguish Document Sources and Annotation Subsets:
Streamlined Access for Annotation Projects:
The main issue with this approach is that it will require quite a bit of work on updating the existing tooling that Jerome wrote, including the website etc.
Alternative Solution: Streamlined Organization without Full Centralization
Maintain Project-Specific Document Sets:
Implement Annotation updating tools:
Implement tools for better merging of documents in existing projects:
Other relates issues:
neuroquery/pubget#46
The text was updated successfully, but these errors were encountered: