Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Topic-first identification of preprints #235

Open
rando2 opened this issue Apr 20, 2020 · 12 comments
Open

Topic-first identification of preprints #235

rando2 opened this issue Apr 20, 2020 · 12 comments
Labels
Technical Technical concerns, enhancements, etc. for the GitHub enthusiasts

Comments

@rando2
Copy link
Collaborator

rando2 commented Apr 20, 2020

@danich1, who is a PhD student in Casey's lab and an NLP expert, has developed an approach to categorize bioRxiv records using word2vec.

You can find his project here and the notebook that shows how to identify related papers

Note that (I believe) his approach also generates the author type (review vs research), heading (whether it's new results, contradictory results, etc.), and category (field) labels by examining the bioRxiv records. This seems like a way more elegant solution to what I was trying to do in #144, and David has very generously offered to help us adapt his approach for this project!

I wanted to loop in @rdvelazquez and @agitter who have been leading the preprint tracking. I had a few questions in an issue but assuming we can generate a similar dataset to what David is working from, I think it David's workflow and Ryan's workflow will integrate quite nicely!

@rando2 rando2 added the Technical Technical concerns, enhancements, etc. for the GitHub enthusiasts label Apr 20, 2020
@rando2
Copy link
Collaborator Author

rando2 commented Apr 21, 2020

I'm putting this announcement of bioRxiv's XML repository here in case it's useful later: http://connect.biorxiv.org/news/2020/04/18/tdm

@danich1
Copy link
Contributor

danich1 commented Apr 21, 2020

Oh awesome find @rando2. Look like bioRxiv got their xml dump ready for the public. As answered in greenelab/annorxiver#7 I couldn't share the dump as the bioRxiv group didn't want me to share it with anybody until they were ready to go public.

BioRxiv is an option to go down, but it might be worth taking a look at this dataset. AllenAi constructed their own specific COVID-19 dump in json format. They also are hosting a kaggle competition using that dataset (kernels have cool results and have code that does embeddings as previously discussed). Avenues to consider, but if my code is ended up being used I can adapt it to AllenAi's dataset. Will be a simple fix.

@rando2
Copy link
Collaborator Author

rando2 commented Apr 23, 2020

Thank you so much for this @danich1! So I looked into the structure of that Allen AI dataset you suggested and it seems pretty straightforward to combine the metadata file with the output from what Ryan has been building and use it in your approach.

The Allen AI dataset seems to include DOIs for every record (at least from a first glance) and abstracts as well. For the full text, I believe the has_full_text, full_text_file, and sha fields will make it possible to locate the appropriate json file in the Allen AI datase (in pseudocode, if has_full_text == True, filename = full_text_file/sha.json in the dataset).

I put code that combines the two datasets here in case it's helpful.

Two concerns:

  1. @rdvelazquez you probably already know this but the output directory is empty right now, so I used an old version of "sources_cross_reference.tsv" that I had built locally to test this.
  2. Not all of the articles we're tracking through Ryan's approach have DOIs, but this seems to mostly be things like textbooks or websites. Given that part of the goal here is just to build reading lists of suggested articles, I think it's okay? But definitely happy to hear what others think!

@danich1 what can I do to help with integrating your pipeline and Ryan's tracking? Thank you so much for lending your talents to this project!

@rdvelazquez
Copy link
Collaborator

@rdvelazquez you probably already know this but the output directory is empty right now, so I used an old version of "sources_cross_reference.tsv" that I had built locally to test this.

The output directory of the master branch should be empty (except for the readme). The output branch should have the sources_cross_reference.tsv file though.

@rando2
Copy link
Collaborator Author

rando2 commented Apr 24, 2020

Thanks @rdvelazquez, I see it there! If it would be helpful to see the union of the AIA dataset and the most recent sources_cross_reference.tsv, I can generate an updated version.

@rdvelazquez
Copy link
Collaborator

rdvelazquez commented Apr 24, 2020

I was actually just looking at that now (I just pulled down your allenAI branch). It looks good!

The AIA dataset seems to have a lot of tangentially related papers (something like 35,000+ of the papers are before 2020)... We can leave all those in our analyses unless (a) it starts to bog down performance and/or (b) the more relevant, newer, cover-19 specific papers are getting lost in all the other ones. Also, removing the marginally related papers before clustering might give better clustering results for the papers that we really care about.

@rdvelazquez
Copy link
Collaborator

p.s. some of the papers seemed to get messed up (author names in all the fields) when I viewed the AllenAImetadata-tracked.csv in excel. It worked when I changed the output to TSV so may want to consider that moving forward.

@rando2
Copy link
Collaborator Author

rando2 commented Apr 24, 2020

@rdvelazquez I agree! I was assuming from an NLP perspective, David might want to have the option to drop papers rather than having too little information (especially since the mechanisms of the drugs, for example, are probably not published recently). But for the user-facing list, we will almost certainly want to filter by date so we don't end up with huge lists! This list does seem a lot better organize than the CDC list I was originally working from, and seems to have a lot fewer character sets!

Also, thank you for the heads up about the TSV!

@danich1
Copy link
Contributor

danich1 commented Apr 24, 2020

@rando2 So my pipeline should be tangential to Ryan's tracking. (Provided I'm following the chain of messages correctly) This means the only thing my pipeline needs is all the papers from both resources in a unified format (e.g. json or xml). Once that has been gathered, the biggest change needed will be to update the paths in the notebook. If the papers are in json format, then more changes will be needed; however, we can cross that path if it arises.

@rando2
Copy link
Collaborator Author

rando2 commented Apr 24, 2020

@danich1 Excellent! So the big dataset (Allen AI) is in json format. Assuming it sounds good to @rdvelazquez this weekend I will look into setting up the following steps:

  1. ID any DOIs that we're tracking here (Ryan's dataset) that aren't in the Allen AI dataset (most likely a "yet" is implied here since AIA is updated weekly and Ryan's is updated with CI)
  2. Use the bioRxiv xml repository to pull the records for those papers (not sure what we should do if they're not in bioRxiv/medRxiv, but I imagine most will be)
  3. Convert xml to json so that they match the data from Allen AI

Thank you so much everyone, this is exciting!

@rdvelazquez
Copy link
Collaborator

Sounds good to me! Let me know if you run into any problems and I'm glad to help out.

@rando2
Copy link
Collaborator Author

rando2 commented May 18, 2020

From @SiminaB here is a related effort: https://covidscholar.org

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Technical Technical concerns, enhancements, etc. for the GitHub enthusiasts
Projects
None yet
Development

No branches or pull requests

3 participants