Topic-first identification of preprints #235

rando2 · 2020-04-20T22:52:45Z

@danich1, who is a PhD student in Casey's lab and an NLP expert, has developed an approach to categorize bioRxiv records using word2vec.

You can find his project here and the notebook that shows how to identify related papers

Note that (I believe) his approach also generates the author type (review vs research), heading (whether it's new results, contradictory results, etc.), and category (field) labels by examining the bioRxiv records. This seems like a way more elegant solution to what I was trying to do in #144, and David has very generously offered to help us adapt his approach for this project!

I wanted to loop in @rdvelazquez and @agitter who have been leading the preprint tracking. I had a few questions in an issue but assuming we can generate a similar dataset to what David is working from, I think it David's workflow and Ryan's workflow will integrate quite nicely!

rando2 · 2020-04-21T12:21:24Z

I'm putting this announcement of bioRxiv's XML repository here in case it's useful later: http://connect.biorxiv.org/news/2020/04/18/tdm

danich1 · 2020-04-21T14:32:07Z

Oh awesome find @rando2. Look like bioRxiv got their xml dump ready for the public. As answered in greenelab/annorxiver#7 I couldn't share the dump as the bioRxiv group didn't want me to share it with anybody until they were ready to go public.

BioRxiv is an option to go down, but it might be worth taking a look at this dataset. AllenAi constructed their own specific COVID-19 dump in json format. They also are hosting a kaggle competition using that dataset (kernels have cool results and have code that does embeddings as previously discussed). Avenues to consider, but if my code is ended up being used I can adapt it to AllenAi's dataset. Will be a simple fix.

rando2 · 2020-04-23T23:58:44Z

Thank you so much for this @danich1! So I looked into the structure of that Allen AI dataset you suggested and it seems pretty straightforward to combine the metadata file with the output from what Ryan has been building and use it in your approach.

The Allen AI dataset seems to include DOIs for every record (at least from a first glance) and abstracts as well. For the full text, I believe the has_full_text, full_text_file, and sha fields will make it possible to locate the appropriate json file in the Allen AI datase (in pseudocode, if has_full_text == True, filename = full_text_file/sha.json in the dataset).

I put code that combines the two datasets here in case it's helpful.

Two concerns:

@rdvelazquez you probably already know this but the output directory is empty right now, so I used an old version of "sources_cross_reference.tsv" that I had built locally to test this.
Not all of the articles we're tracking through Ryan's approach have DOIs, but this seems to mostly be things like textbooks or websites. Given that part of the goal here is just to build reading lists of suggested articles, I think it's okay? But definitely happy to hear what others think!

@danich1 what can I do to help with integrating your pipeline and Ryan's tracking? Thank you so much for lending your talents to this project!

rdvelazquez · 2020-04-24T12:42:59Z

@rdvelazquez you probably already know this but the output directory is empty right now, so I used an old version of "sources_cross_reference.tsv" that I had built locally to test this.

The output directory of the master branch should be empty (except for the readme). The output branch should have the sources_cross_reference.tsv file though.

rando2 · 2020-04-24T13:01:19Z

Thanks @rdvelazquez, I see it there! If it would be helpful to see the union of the AIA dataset and the most recent sources_cross_reference.tsv, I can generate an updated version.

rdvelazquez · 2020-04-24T13:07:12Z

I was actually just looking at that now (I just pulled down your allenAI branch). It looks good!

The AIA dataset seems to have a lot of tangentially related papers (something like 35,000+ of the papers are before 2020)... We can leave all those in our analyses unless (a) it starts to bog down performance and/or (b) the more relevant, newer, cover-19 specific papers are getting lost in all the other ones. Also, removing the marginally related papers before clustering might give better clustering results for the papers that we really care about.

rdvelazquez · 2020-04-24T13:12:29Z

p.s. some of the papers seemed to get messed up (author names in all the fields) when I viewed the AllenAImetadata-tracked.csv in excel. It worked when I changed the output to TSV so may want to consider that moving forward.

rando2 · 2020-04-24T13:13:06Z

@rdvelazquez I agree! I was assuming from an NLP perspective, David might want to have the option to drop papers rather than having too little information (especially since the mechanisms of the drugs, for example, are probably not published recently). But for the user-facing list, we will almost certainly want to filter by date so we don't end up with huge lists! This list does seem a lot better organize than the CDC list I was originally working from, and seems to have a lot fewer character sets!

Also, thank you for the heads up about the TSV!

danich1 · 2020-04-24T14:47:46Z

@rando2 So my pipeline should be tangential to Ryan's tracking. (Provided I'm following the chain of messages correctly) This means the only thing my pipeline needs is all the papers from both resources in a unified format (e.g. json or xml). Once that has been gathered, the biggest change needed will be to update the paths in the notebook. If the papers are in json format, then more changes will be needed; however, we can cross that path if it arises.

rando2 · 2020-04-24T20:47:32Z

@danich1 Excellent! So the big dataset (Allen AI) is in json format. Assuming it sounds good to @rdvelazquez this weekend I will look into setting up the following steps:

ID any DOIs that we're tracking here (Ryan's dataset) that aren't in the Allen AI dataset (most likely a "yet" is implied here since AIA is updated weekly and Ryan's is updated with CI)
Use the bioRxiv xml repository to pull the records for those papers (not sure what we should do if they're not in bioRxiv/medRxiv, but I imagine most will be)
Convert xml to json so that they match the data from Allen AI

Thank you so much everyone, this is exciting!

rdvelazquez · 2020-04-24T21:20:09Z

Sounds good to me! Let me know if you run into any problems and I'm glad to help out.

rando2 · 2020-05-18T13:07:04Z

From @SiminaB here is a related effort: https://covidscholar.org

rando2 added the Technical Technical concerns, enhancements, etc. for the GitHub enthusiasts label Apr 20, 2020

rdvelazquez mentioned this issue Apr 21, 2020

Making our Data More Accessible (for Internal and External Use) #172

Closed

rando2 mentioned this issue May 7, 2020

Combine Allen AI dataset with our referenced papers #286

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Topic-first identification of preprints #235

Topic-first identification of preprints #235

rando2 commented Apr 20, 2020 •

edited

Loading

rando2 commented Apr 21, 2020

danich1 commented Apr 21, 2020

rando2 commented Apr 23, 2020 •

edited

Loading

rdvelazquez commented Apr 24, 2020

rando2 commented Apr 24, 2020

rdvelazquez commented Apr 24, 2020 •

edited

Loading

rdvelazquez commented Apr 24, 2020

rando2 commented Apr 24, 2020

danich1 commented Apr 24, 2020

rando2 commented Apr 24, 2020 •

edited

Loading

rdvelazquez commented Apr 24, 2020

rando2 commented May 18, 2020

Topic-first identification of preprints #235

Topic-first identification of preprints #235

Comments

rando2 commented Apr 20, 2020 • edited Loading

rando2 commented Apr 21, 2020

danich1 commented Apr 21, 2020

rando2 commented Apr 23, 2020 • edited Loading

rdvelazquez commented Apr 24, 2020

rando2 commented Apr 24, 2020

rdvelazquez commented Apr 24, 2020 • edited Loading

rdvelazquez commented Apr 24, 2020

rando2 commented Apr 24, 2020

danich1 commented Apr 24, 2020

rando2 commented Apr 24, 2020 • edited Loading

rdvelazquez commented Apr 24, 2020

rando2 commented May 18, 2020

rando2 commented Apr 20, 2020 •

edited

Loading

rando2 commented Apr 23, 2020 •

edited

Loading

rdvelazquez commented Apr 24, 2020 •

edited

Loading

rando2 commented Apr 24, 2020 •

edited

Loading