Introduction
Pubmed is a free search engine accessing primarily the MEDLINE database of references and abstracts on life sciences and biomedical topics.
A simple search for 'covid-19' on pubmed resulted in 156,871 citations, an impossible number of papers for anyone to seep through.
I set out to see whether I can get some kind of understanding of the trend in literature using natural language processing.
The data were downloaded from pubmed after search for covid-19
Libraries used
- nltk
- pandas
- wordcloud
- sklearn
Approaches:
- Cleaning data . Records with null values, abstracts ith length less than 50 words (not true abstracts after visual inspection) are removed
- Removing stopwords
- Perfirming lemmatization
- Ranking most frequently used words
- Visualizing most used words using WordCloud
- Predicting ducument labels to identify types of studies performed
- Topic Modeling using Latent Dirichlet Allocation technique
- Recommeding collection of articles based on top words for each topic
Summary
-
In the first analysis, I imported csv fiels from pubmed that contains the dumped data for 2019-2020. Three types of analysis were performed to get some insight on covid literature.
-
The first attempt was to examine the top words in the whole data set to see what insights one might be able to infer from the data. I found that one can get a sense of what was discussed based on top words. But, there is limit to get indepth information based on top words.
-
Next I took the advange of train test and validate data sets that have prelabels for the type of study for the abstract. I tranied a SVC and predicted the labels for the test and validation data sets. The labels in the train set has one or more labels. The model was able to make about 44% correct predictions if one consider as long as the predicted label is one of the words in the set, it is a correct prediction. A detailed examination of the result shown that this approach was not ddsirable since one only one label was predicted by the model. To do this properly, the labels should be reexamined and cleaned.
-
Thirdly, I performed topic modeling using Latent Dirichlet Allocation (LDA) method. Since the original data sets has eight differnt labels, I first performed the analysis using 8 topics. Then also tried 15 topics. Based on the top ten words for the mostly discussed topic, I concluded that 8 topics is sufficient for teasing out the difference in topics. BAsed on this analysis, it is possible to pick out papers that disccuss certain topics.
-
Based on the results, WordNet Lemmatizer didn't do a good job removing the stem of some words.
-
In the second analysis, I downloaded more current data (limited to 10,000 records) from pubmed on covid-19 in a text format. Theh performed word frquency and topic analysis.