A full day work shop to introduce TIGR2ESS delegates to the principles and practice of ContentMining (aka Text and Data Mining, TDM) for plant science research.
Note: all materials in dictionaries and code use only characters 32-127 ("ANSI"). Much software, e.g. browsers, is unable to render higher code points reliably. We appreciate that this does not do justice to names, places and words from non-anglophone cultures.
Throughout this we shall use Ocimum sanctum (Holy Basil) as the primary example. Tutorial material has been generated for most steps using O. sanctum.
The material should be located on $HOME/Desktop/
, e.g. $HOME/Desktop/ContentMine/
.
For most work the "starting directory" is ContentMine/tigr2ess
(aka "TIGR2ESS_HOME"). You may also need ContentMine/software
and ContentMine/dictionaries
.
The scientific/medical literature generates thousands of articles per day, and is continuing to increase. This workshop uses a subset, the Open Access subset in Europe PubMedCentral (EPMC) which though not comprehensive is up-to-date. It is well suited for rapid overviews of subjects, concepts and entities. TDM of the whole literature is messy and difficult but The NIH/NLM has required Open articles to be converted to JATS/XML which is easy to download, search and analyse. Although the NIH concentrates on medical literature, they index enough plant science to find very valuable new insights and summaries.
All the techniques in this workshop are generally valuable in all scientific fields. The major limitation is the availability of open, semantic fulltext but the development of preprints and Open Access should imprive this rapidly.
Delegates will learn how to:
- search EPMC for plant science articles
- download and organize them in Content Mine projects (CProjects)
- understand the power of Wikipedia and Wikidata for plant sciences and run searches.
- create small dictionaries of relevance to TIGR2ESS.
- use dictionaries to search downloaded data for terms (entities) of special interest (typically genes, plant species, plant parts, phytochemicals, human diseases, countries, funding agencies).
- analyze the frequency of entities (showing what the "literature" is most interested in).
- plot the co-occurrence of terms (e.g. plants associated with disease).
All work is on delgates' laptops. They will need to:
- install the ContentMine software ("getpapers" and "AMI")
- verify it works on tests
- run the specific exercises in the tutorial
- be able to use a JSON editor for editing dictionaries
Ideally the technology for this should be varified before the full-day workshop
This lists the formal exercises that we will carry out in the first half of the workshop . All exercises will be verified beforehand to ensure that they:
- work on major operating systems and hardware.
- can be edited in case of variable download speeds or unexpected drop-outs.
The following order may change and some may be combined
Check that getpapers and AMI have been installed by everyone. Every delegate will have a memory stick containing:
- these instructions
- software (getpapers and AMI)
- typical searches and results (Ocimum and 3 crops (millet, rice, wheat))
- current dictionaries
(Each of these will have a separate full page)
Use the EuropePMC.org interface to explore how queries work. Understand false positives. Development of complex queries (AND, OR) and the likely pitfalls. details
Note. Bandwidth may need to be managed. Download a subset (10 articles) of the Ocimum papers in EPMC using getpapers initially in XML. Extend query to "ocimum AND country" (this will limit bandwidth). Delegates will then retrieve PDF. Feedback/discussion of semantics and XML and the differences between XML and PDF and the values of each. (Delegates can then download more papers in their own time.
Open resources (MESH/Medline, Taxdump, etc. etc.) . Many of these are being subsumed into Wikidata.
What WP is, how it is updated and checked. Overview of plant taxonomy in WP. Special pages (Category: foo, List_of_foo).
Exploration and discussion of Items (Q) and Properties (P) relating to O. sanctum .
https://www.wikidata.org/wiki/Wikidata:WikiFactMine . A large set of dictionaries developed by ContentMine generally from SPARQL queries. Creation of a Wikifactmine dictionary for "crops".
AMI will be used to transform the downloaded XML into HTML (which will be used for the searching).
AMI will be used to compute the raw frequencies of words in each article. The result is displayed in dataTables. Discussion on the distruction of terms found in the articles.
AMI will be used to search the local HTML papers using dictionaries. Initially these will be bundled with the system. Results added to dataTables. The terms will then link back to Wikidata and Wikipedia for interpretation.
AMI can display the co-occurrence of termsfrom different dictionaries.
At this stage it should become clear what the value of dictionaries are. Delegates will be invited to edit JSON dictionaries, initially by adding extra entries. Ideally all entries should reference Wikidata. Later delegates can create complete (small) dictionaries either by including items by hand, or using AMI to explore Wikipedia pages into dictionaries.
In the latter part of the day we invite delegates to form small groups and tackle meaningful literature-based projects. These could be centred on other plant (or animal/microbial) species, geographical regions, genes, and even political entities. This may involve:
- creating a new dictionary for the topic
- downloading more articles
There is also the opportunity to use other downstream analysis programs, e.g. R. The AMI output is text files on the file system - there are no databases required. Instead there is a schema of file hierarchy and sematics. Therefore any delegates who are fluent in R, Python, shell, etc. can use these files directly.