Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explain how literatures are classified. Classify literature according to model organisms #26

Open
goldturtle opened this issue Jun 18, 2019 · 10 comments

Comments

@goldturtle
Copy link
Contributor

Transferring from email thread to this ticket.

Michael wrote on 2016-09-15:

There was a request to classify the PMCOA corpus according to more
topics than just Biology, Medicine, Genetics, Genomics etc, i.e.,
according to model organisms. While ultimately we will use an SVM for
that, right now I am scanning subject, title and journal name for
keywords to classify the corpus and make what's known in TPC as
literatures. So I would appreciate if you could contribute keywords for
those fields (subject, article title and journal name) for the following
organisms:

drosophila
C. elegans
arabidopsis
mouse
zebrafish

If you can think of any other model organism that would be of interest
to the curation and biomedical community, let me know.
Michael.

@goldturtle
Copy link
Contributor Author

On 2016-09-22 Chris wrote:

Hi Michael,

I'm not exactly sure what you're looking for, but if you're looking for keywords to pull out papers for those species I guess I would suggest the following:

for drosophila: Drosophila, Drosophila melanogaster, D. melanogaster, fruit fly

for C. elegans: C. elegans, Caenorhabditis elegans, Caenorhabditis

for arabidopsis: Arabidopsis, A. thaliana, Arabidopsis thaliana

for mouse: Mus musculus, musculus, murine, mouse, mice

for zebrafish: zebrafish, Danio rerio, D. rerio

As for other model organisms of interest to the curation and biomedical community:

budding yeast: Saccharomyces cerevisiae

fission yeast: Schizosaccharomyces pombe

slime mold: Dictyostelium discoideum

Norway brown rat: Rattus norvegicus

black rat: Rattus rattus

sea squirt: Ciona intestinalis

African clawed frog: Xenopus laevis

Western clawed frog: Xenopus tropicalis

Bacteria: Escherichia coli (E. coli), Bacillus subtilis (B. subtilis)

There's an extensive list on Wikipedia:

https://en.wikipedia.org/wiki/List_of_model_organisms

I hope that helps,

Chris

@goldturtle
Copy link
Contributor Author

Hi Michael,

I wasn't quite sure what you meant by 'subject' in your email, but here are a few other thoughts for classifying literature by organism.

Paper titles and abstracts are probably reasonably good sources of organism names, with all the usual caveats about false positives (e.g. organism is mentioned but the paper does not contain experiments about it) and false negatives (e.g. authors mention mouse but also do an experiment in a human cell line that they don't mention).

For papers that have already been indexed by PubMed, the MESHHeadingList and ChemicalList tags in the XML could also be used.

The list of organisms on the GO annotation downloads page may be helpful, since it indicates organisms for which there was at least sufficient interest to generate GO annotations:

http://geneontology.org/page/download-annotations

Other microbial species that might be of interest are listed in Table 1 of this WormBook chapter:

http://www.wormbook.org/chapters/www_intermicrobpath/intermicrobpath.html

@goldturtle
Copy link
Contributor Author

More from Chris and Michael:

Michael:
In addition to the official names, are there also common words that are used to identify species? For example, if there is only the word yeast in the title, would it be safe to assign it to S. cerevisiae (as opposed to fission yeast)?

Chris:
For S. cerevisiae the only common words might be "baker's yeast" or "budding yeast". "Yeast" alone would not be sufficient as there are so many types of yeast. For the others that I listed I can only think of (only know of) the names given to the left of the species name, e.g. "slime mold" (of which there are also many types so you would get false positives if looking for Dictyostelium discoideum). For Rattus norvegicus, phrases could be "Norway rat", "brown Norway rat", or simply "brown rat". Wikipedia has some common names for each but I don't know if they would come up in the biomedical literature.

@goldturtle
Copy link
Contributor Author

We could also solicit input from the various MODs here, as most groups probably have PubMed keyword searches that get them a reasonably good list of papers, or at least could help us avoid any obvious pitfalls.

@goldturtle
Copy link
Contributor Author

In the nxml files there is a 'subject' line, but I don't know what
guidelines it follows. I'll follow up on your links.

M.

On 09/27/2016 10:59 AM, vanaukenk wrote:

Hi Michael,

I wasn't quite sure what you meant by 'subject' in your email, but
here are a few other thoughts for classifying literature by organism.

Paper titles and abstracts are probably reasonably good sources of
organism names, with all the usual caveats about false positives (e.g.
organism is mentioned but the paper does not contain experiments about
it) and false negatives (e.g. authors mention mouse but also do an
experiment in a human cell line that they don't mention).

For papers that have already been indexed by PubMed, the
MESHHeadingList and ChemicalList tags in the XML could also be used.

The list of organisms on the GO annotation downloads page may be
helpful, since it indicates organisms for which there was at least
sufficient interest to generate GO annotations:

http://geneontology.org/page/download-annotations

Other microbial species that might be of interest are listed in Table
1 of this WormBook chapter:

http://www.wormbook.org/chapters/www_intermicrobpath/intermicrobpath.html


You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
#66 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AIPBG27UFPB8FmUtqhPcmGNOy4ZPzf8Aks5quVmegaJpZM4Hls9K.

@goldturtle
Copy link
Contributor Author

Okay, thanks. I was looking at the XML display for papers in PubMed and searching the page for 'Subject' but couldn't find anything there. For example:
https://www.ncbi.nlm.nih.gov/pubmed/27665728?report=xml&format=text
Do you have a URL or other location for an nxml that I could look at?

@goldturtle
Copy link
Contributor Author

I attached an example. Open it in an text editor and search for the
subject tag (quite at the beginning of the file). The keywords are
embedded in the tags.

M.

On 09/27/2016 11:17 AM, vanaukenk wrote:

Okay, thanks. I was looking at the XML display for papers in PubMed
and searching the page for 'Subject' but couldn't find anything there.
For example:
https://www.ncbi.nlm.nih.gov/pubmed/27665728?report=xml&format=text
Do you have a URL or other location for an nxml that I could look at?


You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
#66 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AIPBGyeLpIpnc7LJqItae6JOjCvphgdbks5quV2jgaJpZM4Hls9K.

@goldturtle
Copy link
Contributor Author

Interesting example.

Comparing the nxml subjects:
Ecology
Evolutionary Biology
Ecology/Behavioral Ecology
Ecology/Evolutionary Ecology
Ecology/Population Ecology
Evolutionary Biology/Animal Behavior
Evolutionary Biology/Evolutionary Ecology

with the MESHHeadings:
Animals

Grasshoppers
genetics

Phenotype

Phylogeny

Pigmentation
genetics

Polymorphism, Genetic

Population Dynamics

Survival Analysis

there isn't much, if any overlap, and the actual organism, grasshoppers, is only represented in the MESHHeadingList.

I don't know how representative this example is, but it suggests that maybe a union of the nxml subjects with the MESH terms might be the best source for mining subjects for literature classification.

@goldturtle
Copy link
Contributor Author

Well, in this case the animal is mentioned in the title. I don't
understand why PMC doesn't have the MeSH terms in their nxmls. The issue
is that I would like to do any classification with the nxml file only
and not introduce a third source that needs to be synced with the paper.

M.

On 09/27/2016 11:57 AM, vanaukenk wrote:

Interesting example.

Comparing the nxml subjects:
Ecology
Evolutionary Biology
Ecology/Behavioral Ecology
Ecology/Evolutionary Ecology
Ecology/Population Ecology
Evolutionary Biology/Animal Behavior
Evolutionary Biology/Evolutionary Ecology

with the MESHHeadings:
Animals

Grasshoppersgenetics

Phenotype

Phylogeny

Pigmentation
genetics

Polymorphism, Genetic

Population Dynamics

Survival Analysis

there isn't much, if any overlap, and the actual organism,
grasshoppers, is only represented in the MESHHeadingList.

I don't know how representative this example is, but it suggests that
maybe a union of the nxml subjects with the MESH terms might be the
best source for mining subjects for literature classification.


You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
#66 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AIPBG3Xx_X3aRxIcQnBs305YpIkGJpBeks5quWcxgaJpZM4Hls9K.

@goldturtle
Copy link
Contributor Author

Yes, I see the point about not wanting to have to go to a third source.
I don't know why PMC doesn't also include MeSH terms in their nxmls.
Perhaps an alternative would be to make use of MeSH headings to find other terms or phrases to include for organismal TPC literature classification, although I suspect you have a good list to start from.
For all of the genus species names, though, I think you will want to also look at abbreviations like:
S. cerevisiae
X. laevis
etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant