search-covid-literature-using-NLP

Introduction

Pubmed is a free search engine accessing primarily the MEDLINE database of references and abstracts on life sciences and biomedical topics.
A simple search for 'covid-19' on pubmed resulted in 156,871 citations, an impossible number of papers for anyone to seep through.
I set out to see whether I can get some kind of understanding of the trend in literature using natural language processing.

The data were downloaded from pubmed after search for covid-19

Libraries used

nltk
pandas
wordcloud
sklearn

Approaches:

Cleaning data . Records with null values, abstracts ith length less than 50 words (not true abstracts after visual inspection) are removed
Removing stopwords
Perfirming lemmatization
Ranking most frequently used words
Visualizing most used words using WordCloud
Predicting ducument labels to identify types of studies performed
Topic Modeling using Latent Dirichlet Allocation technique
Recommeding collection of articles based on top words for each topic

Summary

In the first analysis, I imported csv fiels from pubmed that contains the dumped data for 2019-2020. Three types of analysis were performed to get some insight on covid literature.
The first attempt was to examine the top words in the whole data set to see what insights one might be able to infer from the data. I found that one can get a sense of what was discussed based on top words. But, there is limit to get indepth information based on top words.
Next I took the advange of train test and validate data sets that have prelabels for the type of study for the abstract. I tranied a SVC and predicted the labels for the test and validation data sets. The labels in the train set has one or more labels. The model was able to make about 44% correct predictions if one consider as long as the predicted label is one of the words in the set, it is a correct prediction. A detailed examination of the result shown that this approach was not ddsirable since one only one label was predicted by the model. To do this properly, the labels should be reexamined and cleaned.
Thirdly, I performed topic modeling using Latent Dirichlet Allocation (LDA) method. Since the original data sets has eight differnt labels, I first performed the analysis using 8 topics. Then also tried 15 topics. Based on the top ten words for the mostly discussed topic, I concluded that 8 topics is sufficient for teasing out the difference in topics. BAsed on this analysis, it is possible to pick out papers that disccuss certain topics.
Based on the results, WordNet Lemmatizer didn't do a good job removing the stem of some words.
In the second analysis, I downloaded more current data (limited to 10,000 records) from pubmed on covid-19 in a text format. Theh performed word frquency and topic analysis.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.gitignore		.gitignore
README.md		README.md
covid_pubmed.ipynb		covid_pubmed.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

search-covid-literature-using-NLP

About

Releases

Packages

Languages

Xuehong-pdx/Covid-literature-analysis-using-NLP

Folders and files

Latest commit

History

Repository files navigation

search-covid-literature-using-NLP

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages