Pull together the normalization information in a dataset ⛙ #3

sdruskat · 2021-03-31T17:57:11Z

What do we have?

A set of normalized software mentions (Normalize software mentions (a.k.a data cleaning 🧹) #1 )

The issue

We need some sort of dataset to count mentions according to #2.

What do we really need?

A dataset to count the mentions on

There are several ways this could look:

A list of all normalized mentions with info on which paper they appeared in
An enriched version of CORD-19 with annotations of the normalized mention per software mention per paper, i.e.,
the information that Software1 and Software2 are both mentioned in this paper, even if Software1 was actually mentioned as software one or SW 1, and perhaps the count of each mention per paper
A new dataset which reuses information from CORD-19 but presents it in a cleaned-up fashion, and possibly some other format

How can we achieve this?

Ideas welcome (Jupyter Notebook perhaps?)

The text was updated successfully, but these errors were encountered:

olexandr-konovalov · 2021-03-31T20:45:34Z

+1 for Jupyter. Can have fully automated and reproducible analysis which downloads the CSV file (or has a refined dataset in the repository) and allows to re-run it on Binder: https://github.com/rse-standrewscs/python-binder-template

olexandr-konovalov · 2021-03-31T20:47:53Z

Still some code should in in .py files, easier to keep under version control, test etc.

Obligatory reading is https://doi.org/10.1371/journal.pcbi.1007007

There is also a tool for diffing and merging Jupyter notebooks: https://nbdime.readthedocs.io/

sdruskat added the required Something that needs to be done to make the hack successful label Mar 31, 2021

sdruskat added this to the Habeas useful corpus milestone Mar 31, 2021

This was referenced Mar 31, 2021

Identify the n most popular packages for COVID-19-related research as seed data for building an enriched dataset 🌱 #4

Open

Find out if the seed data packages are publicly available, and annotate them respectively 🔗 #5

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pull together the normalization information in a dataset ⛙ #3

Pull together the normalization information in a dataset ⛙ #3

sdruskat commented Mar 31, 2021 •

edited

Loading

olexandr-konovalov commented Mar 31, 2021

olexandr-konovalov commented Mar 31, 2021

Pull together the normalization information in a dataset ⛙ #3

Pull together the normalization information in a dataset ⛙ #3

Comments

sdruskat commented Mar 31, 2021 • edited Loading

What do we have?

The issue

What do we really need?

How can we achieve this?

olexandr-konovalov commented Mar 31, 2021

olexandr-konovalov commented Mar 31, 2021

sdruskat commented Mar 31, 2021 •

edited

Loading