Skip to content

docanalysis README suggested edits to early version of "Running docanalysis" section

Emanuel Faria edited this page Jul 27, 2022 · 1 revision

I've re-written some of it the README with the intent of writing to a non-academic / non-programmer audience. My comments are added throughout -- usually highlighted by emoji

<!doctype html>

running docanalysis

🔔🔔

It would be great to have a startup command that would work something like this...

docanalyis —-start

If you would like to first create a new virtual environment (venv) [do this…]

If you would like to activate an existing venv [type this] ….

🔔🔔

help menu

Once docanalysis is installed, typing `docanalysis --help` (followed by enter/return) into your terminal will display the help menu (see below).

the usage message:

As is customary, near the top of the help menu is the menu section title “usage:”.

On the left, the word “docanalysis” is displayed. This is the command that actually launches the program.

On the right, is a list of all the** argument options, or “flags” (displayed here in square brackets “#section” to indicate the syntax by which they may be used). Flags operate as sub-commands by which you will operate the program and customize it’s use to suit your particular purposes.

(Note that square brackets “[]” are used here in the usage message solely to facilitate ease of reading. To actually use the argument options you will use either a single or double dash as shown in the “optional arguments:” section of the help menu).

the argument options/flags (and explanations)

In this section of the help menu, a list of arguments options (also known as “flags”) is displayed along with descriptions as to their purpose and/or use. Flags can be specified with either a single dash (-) or a double dash (–), and sometimes both. When building docanalysis commands, use one or the other, but not both.

Rather than being listed alphabetically, in our help menu we’ve chosen to display them in the relative syntax order with which they would most likely be used, and grouped together in any sub-options that are similar in function. For example, besides defining the directory on your computer where you would like an export to be saved, you must also define the filetype(s) you wish to export (html, json, or .csv), and it makes sense to write those together in your command.

!!⛔️⛔️ Help Menu S​uggestions:
🔔Top of help should begin with “Welcome to docanalysis version x.x.x.” To check for and install updates, type docanalysis --update"
🔔Use a lines of dashes to visually separate different parts/categories of information in the help dialog
🔔standardize single and double dash use. Why do some (eg --html HTML) not have the single dash version? Is this a PC/MacOS thing?
🔔Remember to activate (launch) the required venv every time you run docanalysis and deactivate (quit) it thereafter.
⛔️⛔️!!

Welcome to docanalysis version 0.1.1

🔔New versions: https://pypi.org/project/docanalysis/

🔔To upgrade on Windows: pip install --force-reinstall --no-cache-dir docanalysis 🔔To upgrade on Mac: pip3 install --force-reinstall --no-cache-dir docanalysis

🔔For detailed setup, usage and background information, see the docanalysis READMEhttps://github.com/petermr/docanalysis/blob/main/README.md


🔔docanalysis initializes the program and preceeds the launch of all other sub-programs and customizes their operation via the argument options (also called “flags”) displayed in square brackets below.❓

usage: "docanalysis [options]"


docanalysis [options] [-h] 🔔[-V] [--run_pygetpapers] [--make_section] [-q QUERY] [-k HITS] [--project_name PROJECT_NAME] [-d DICTIONARY] [-o OUTPUT] [--make_ami_dict MAKE_AMI_DICT] ⛔explain the use of sub-brackets such as these below this paragraph⛔ [--search_section [SEARCH_SECTION [SEARCH_SECTION ...]]] [--entities [ENTITIES [ENTITIES ...]]] [--spacy_model SPACY_MODEL] [--html HTML] [--synonyms SYNONYMS] [--make_json MAKE_JSON] [-l LOGLEVEL] [-f LOGFILE]

options:

-h, --help ⛔️display this help menu and usage information❓Dialog??❓ ❓and exit❓⛔️

-V, --version display currently installed the version number docanalysis ⛔️and it's sub-programs??⛔️

========= GETPAPERS ARGUMENT OPTIONS ========= ⛔️⛔️ Is docanalysis the “program” and the other tools, such as “pygetpapers” sub-programs? If so, distinguishing this will make the part about building command-line queries easier to explain and understand.⛔️⛔️

--run_pygetpapers launches pygetpapers, the sub-program within docanalysis that downloads papers from europepmc.org, subject to the user’s QUERY parameters

-q ,<query>, --query <query> replace <query> with the boolean search parameters that pygetpapers will use to download the desired articles from europepmc.org. NOTE:⛔️ specified queries must begin and end with quotation marks ("").⛔️ Example: docanalysis --run_pygetpapers -q "terpene"

========== GETPAPERS EUPMC DOWNLOAD OPTIONS ========= -k <hits>, --hits <hits> replace <hits> with the numerical value specifying the maximum number of papers you wish find and download Example: docanalysis --run_pygetpapers -q “terpene” -k 10

⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️

What happened to the other options that were available in the original version of getpapers?

-n, --noexecute reports how many results match the query, without actually downloading anything.

There are over 39 million articles, preprints and more in EUPMC; we don't want to download all by mistake, so it's worth running a query with -n to test, and perhaps -k 200 to download the first trial set. You can download thousands, but the connection may break and it's worth being able to develop the analysis anyway.

-a, --all search all papers, not just open access

--api <name> API to search [eupmc, crossref, ieee, arxiv] (default: eupmc) -x, --xml download fulltext XMLs if available -p, --pdf download fulltext PDFs if available -s, --supp download supplementary files if available -t, --minedterms download text-mined terms if available --filter <filter object> filter by key value pair, passed straight to the crossref api only -r, --restart restart file downloads after failure

we need --INPUTTEXTLOC and --OUTPUTTEXTDIR

⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️

========= ANNOTATION OPTIONS =========

-d <dictionary>, --dictionary <dictionary> Replace "DICTIONARY" with the name⛔️path??⛔️ of an ⛔️ami dictionary by which to annotate sentences or support supervised entity extraction. 🔔How do I point at dictionaries? Can I point at a directory full of them and have them all discovered automatically?🔔

--spacy_model SPACY_MODEL optional. Choose between spacy or scispacy models. Defaults to spacy

--search_section [SEARCH_SECTION [SEARCH_SECTION ...]] provide section(s) to annotate. Choose from: ✍️ALL, ACK, AFF, AUT, CON, DIS, ETH, FIG, INT, KEY, MET, RES, TAB, TIL. Defaults to ALL✍️

--entities [ENTITIES [ENTITIES ...]] provide entities to extract. Default(ALL), or choose from SpaCy: ✍️CARDINAL, DATE, EVENT, FAC, GPE, LANGUAGE, LAW, LOC, MONEY, NORP, ORDINAL, ORG, PERCENT, PERSON, PRODUCT, QUANTITY, TIME, WORK_OF_ART; SciSpaCy: CHEMICAL, DISEASE✍️ ⛔️⛔️What about SciSpacy? I think we should include SciSpacy in the installation and provide instructions for using SpaCy, SciSpacy or both simultaneously. This would also show us whether SciSpacy installation is in compatible or breaks the docanalysis installation⛔️⛔️

--synonyms SYNONYMS searches the corpus/sections with synonymns from ami-dict

========= MAKE/EXPORT OPTIONS =========
-o OUTPUT, --output OUTPUT outputs csv file ⚠️csv only, or is there a list of options? ⚠️ ⁉️wouldn't tsv be "safer" for chemical names, etc.?⁉️

⛔️--html HTML saves output in html format ⁉️to given path⁉️ (can user choose path?)

--make_json MAKE_JSON output in json format ⁉️To what end?⁉️

--make_section makes sections ⁉️ALL? or can these be specified?⁉️

--make_ami_dict MAKE_AMI_DICT provide title for ami-dict. Makes ami-dict of all extracted entities

========= EXPORT FOLDER/PATH OPTIONS =========
--project_name <project_name> ⛔️replaced capitalization with lower case in “<>"⛔️

⁉️Suggest that we combine project_Name with output_directory <-o <path>, --⛔️outdir⛔️ <path>< (as was used in original version of get papers to avoid confusion about naming a folder and deciding where it goes⁉️ Replace "PROJECT_NAME" with your choice of name for the folder/directory that will be created ⁉️in your venv? is file path chosen here?⁉️ to store/contain the papers you download for further docanalysis processing. ⁉️(I think --project_folder would be more "for Dummies" user-friendly)⁉️

========= LOG DISPLAY AND EXPORT ========= ⛔️⛔️-l, --loglevel <level> amount of information to log (silent, verbose, info*, data, warn, error, or debug)⛔️⛔️

-l LOGLEVEL, --loglevel LOGLEVEL provide logging level. Example --log warning ⛔️choose one? let's add descriptions for each level⛔️<<info,warning,debug,error,critical>>, default='info'

-f LOGFILE, --logfile LOGFILE saves log to specified file in output directory as well as printing to terminal

⁉️(-x -s -t -p and -n) ⛔️⛔️⛔️ What happened to the other options that were available in the original version of getpapers?⛔️⛔️⛔️

--api <name> API to search [eupmc, crossref, ieee, arxiv] (default: eupmc) -x, --xml download fulltext XMLs if available -p, --pdf download fulltext PDFs if available -s, --supp download supplementary files if available -t, --minedterms download text-mined terms if available

Example commands

Purpose/Category Command Sub-Command/Program Option Sub-Option Description
Run Program docanalysis        
Run Sub-Program   —run_pygetpapers      
           
           
           
           

Example

INPUT

docanalysis --project_name terpene_10 --make_section --spacy_model spacy --entities ORG --output org.csv

LOGS

INFO: Found 7134 sentences in the section(s).
INFO: Loading spacy
100% 7134/7134 [01:08<00:00, 104.16it/s]
/usr/local/lib/python3.7/dist-packages/docanalysis/entity_extraction.py:352: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
  "[", "").str.replace("]", "")
INFO: wrote output to /content/terpene_10/org.csv

Extract information from specific section(s)

You can choose to extract entities from specific sections

Example

COMMAND

docanalysis --project_name terpene_10 --make_section --spacy_model spacy --search_section AUT, AFF --entities ORG --output org_aut_aff.csv

LOG

INFO: Found 28 sentences in the section(s).
INFO: Loading spacy
100% 28/28 [00:00<00:00, 106.66it/s]
/usr/local/lib/python3.7/dist-packages/docanalysis/entity_extraction.py:352: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
  "[", "").str.replace("]", "")
INFO: wrote output to /content/terpene_10/org_aut_aff.csv

Create dictionary of extracted entities

COMMAND

docanalysis --project_name terpene_10 --make_section --spacy_model spacy --search_section AUT, AFF --entities ORG --output org_aut_aff.csvv --make_ami_dict org

LOG

INFO: Found 28 sentences in the section(s).
INFO: Loading spacy
100% 28/28 [00:00<00:00, 96.56it/s] 
/usr/local/lib/python3.7/dist-packages/docanalysis/entity_extraction.py:352: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
  "[", "").str.replace("]", "")
INFO: wrote output to /content/terpene_10/org_aut_aff.csvv
INFO: Wrote all the entities extracted to ami dict

Snippet of the dictionary

<?xml version="1.0"?>
- dictionary title="/content/terpene_10/org.xml">
<entry count="2" term="Department of Biochemistry"/>
<entry count="2" term="Chinese Academy of Agricultural Sciences"/>
<entry count="2" term="Tianjin University"/>
<entry count="2" term="Desert Research Center"/>
<entry count="2" term="Chinese Academy of Sciences"/>
<entry count="2" term="University of Colorado Boulder"/>
<entry count="2" term="Department of Neurology"/>
<entry count="1" term="Max Planck Institute for Chemical Ecology"/>
<entry count="1" term="College of Forest Resources and Environmental Science"/>
<entry count="1" term="Michigan Technological University"/>https://github.com/petermr/docanalysis/blob/main/README.md#what-is-a-dictionary

All at one go!

docanalysis --run_pygetpapers -q "terpene" -k 10 --project_name terpene_10 --make_section --output entities_202202019.csv --make_ami_dict entities_20220209.xml 

credits:

developers

special thanks

  • if any

technologies

pygetpapers — searches for and downloads papers from europepmc.org (“EUPMC”) (.html, .xml, .pdf, and/or .json)

NLTK and other Python tools for many operations, and

that ingests CProjects and carries out text-analysis of documents, including sectioning, NLP/text-mining, vocabulary generation. Uses NLTK and other Python tools for many operations, and spaCy or scispaCy for extraction and annotation of entities. Outputs summary data and word-dictionaries.

extraction

docanalysis integrates and leverages the power of the following open-source technologies:

  • py4ami

    • spaCy

      • Here's the list of NER labels SpaCy's English model provides:
        CARDINAL, DATE, EVENT, FAC, GPE, LANGUAGE, LAW, LOC, MONEY, NORP, ORDINAL, ORG, PERCENT, PERSON, PRODUCT, QUANTITY, TIME, WORK_OF_ART

    • sciSpaCy

    • NLTK

  • pygetpapers - scrape open repositories to download papers of interest

    • EUPMC

  • pyamiimage

    • EasyOCR

    • Tesseract

    • NLTK splits sentences

running `docanalysis` =====================

🔔🔔

It would be great to have a startup command that would work something like this...

 

docanalyis —-start

If you would like to first create a new virtual environment (venv) [do this…]

If you would like to activate an existing venv [type this] ….

🔔🔔

 

help menu

Once docanalysis is installed, typing `docanalysis --help` (followed by enter/return) into your terminal will display the help menu (see below).

 

the usage message:

[As is customary](https://en.wikipedia.org/wiki/Usage_message), near the top of the help menu is the menu section title “usage:”.

On the left, the word “docanalysis” is displayed. This is the command that actually launches the program.

On the right, is a list of all the** argument options, or “flags” (displayed here in square brackets “[#section](#section)” to indicate the syntax by which they may be used). Flags operate as sub-commands by which you will operate the program and customize it’s use to suit your particular purposes.

[(Note that square brackets “[]” are used here in the usage message solely to facilitate ease of reading. To actually use the argument options you will use either a single or double dash as shown in the “optional arguments:” section of the help menu)](https://en.wikipedia.org/wiki/Command_line_argument).

 

the argument options/flags (and explanations)

In this section of the help menu, a list of arguments options (also known as “flags”) is displayed along with descriptions as to their purpose and/or use. Flags can be specified with either a single dash (-) or a double dash (–), and sometimes both. When building docanalysis commands, use one or the other, but not both.

Rather than being listed alphabetically, in our help menu we’ve chosen to display them in the relative syntax order with which they would most likely be used, and grouped together in any sub-options that are similar in function. For example, besides defining the directory on your computer where you would like an export to be saved, you must also define the filetype(s) you wish to export (html, json, or .csv), and it makes sense to write those together in your command.

 

!!⛔️⛔️ Help Menu S​uggestions:
🔔Top of help should begin with “Welcome to docanalysis version x.x.x.” To check for and install updates, type docanalysis --update"
🔔Use a lines of dashes to visually separate different parts/categories of information in the help dialog
🔔standardize single and double dash use. Why do some (eg --html HTML) not have the single dash version? Is this a PC/MacOS thing?
🔔Remember to activate (launch) the required venv every time you run docanalysis and deactivate (quit) it thereafter.
⛔️⛔️!!

Welcome to docanalysis version 0.1.1 

🔔New versions: https://pypi.org/project/docanalysis/

🔔To upgrade on Windows:  pip install --force-reinstall --no-cache-dir docanalysis
🔔To upgrade on Mac:      pip3 install --force-reinstall --no-cache-dir docanalysis

🔔For detailed setup, usage and background information, see the docanalysis READMEhttps://github.com/petermr/docanalysis/blob/main/README.md

---------

🔔docanalysis           initializes the program and preceeds the launch of all 
                        other sub-programs and customizes their operation via the 
                        argument options (also called “flags”) displayed in square brackets below.❓

usage: "docanalysis [options]"

-----------------------------------------------------------------------------------
docanalysis [options]  [-h] 🔔[-V] [--run_pygetpapers] [--make_section] [-q QUERY]
                       [-k HITS] [--project_name PROJECT_NAME] [-d DICTIONARY]
                       [-o OUTPUT] [--make_ami_dict MAKE_AMI_DICT]
⛔explain the use of sub-brackets such as these below this paragraph⛔
                       [--search_section [SEARCH_SECTION [SEARCH_SECTION ...]]]
                       [--entities [ENTITIES [ENTITIES ...]]]
                       [--spacy_model SPACY_MODEL] [--html HTML]
                       [--synonyms SYNONYMS] [--make_json MAKE_JSON] [-l LOGLEVEL]
                       [-f LOGFILE]
-----------------------------------------------------------------------------------



options:
------------------

-h, --help              ⛔️display this help menu and usage information❓Dialog??❓ ❓and exit❓⛔️

-V, --version         display currently installed the version number docanalysis 
                        ⛔️and it's sub-programs??⛔️




========= GETPAPERS ARGUMENT OPTIONS ========= 
⛔️⛔️ Is docanalysis the “program” and the other tools, such as “pygetpapers” sub-programs? If so, distinguishing this will make the part about building command-line queries easier to explain and understand.⛔️⛔️

--run_pygetpapers       launches pygetpapers, the sub-program within docanalysis
                        that downloads papers from europepmc.org, subject to the 
                        user’s QUERY parameters

-q ,<query>, --query <query>
                        replace <query> with the boolean search parameters that 
                        pygetpapers will use to download the desired articles from
                        europepmc.org. NOTE:⛔️ specified queries must begin and 
                        end with quotation marks ("").⛔️
                        Example: docanalysis --run_pygetpapers -q "terpene"



========== GETPAPERS EUPMC DOWNLOAD OPTIONS ========= 
-k <hits>, --hits <hits>    replace <hits> with the numerical value specifying the
                        maximum number of papers you wish find and download
                        Example: docanalysis --run_pygetpapers -q “terpene” -k 10


⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️

What happened to the other options that were available in the original version of getpapers?

-n, --noexecute       reports how many results match the query,                      without actually downloading anything. 

There are over 39 million articles, preprints and more in EUPMC; we don't want to download all by mistake, so it's worth running a query with -n to test, and perhaps -k 200 to download the first trial set. You can download thousands, but the connection may break and it's worth being able to develop the analysis anyway.

-a, --all                 search all papers, not just open access

--api <name>            API to search [eupmc, crossref, ieee, arxiv] (default: eupmc)
-x, --xml               download fulltext XMLs if available
-p, --pdf               download fulltext PDFs if available
-s, --supp              download supplementary files if available
-t, --minedterms        download text-mined terms if available
--filter <filter object>  filter by key value pair, passed straight to the crossref api only
-r, --restart             restart file downloads after failure

we need --INPUTTEXTLOC and --OUTPUTTEXTDIR

⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️⛔️

========= ANNOTATION OPTIONS =========   

-d <dictionary>, --dictionary <dictionary>
                        Replace "DICTIONARY" with the name⛔️path??⛔️ of an ⛔️ami 
                        dictionary by which to annotate sentences or
support 
                        supervised entity extraction.
🔔How do I point at dictionaries? Can I point at a directory full of them and have them all discovered automatically?🔔

--spacy_model SPACY_MODEL
                        optional. Choose between spacy or scispacy models.
                        Defaults to spacy

--search_section [SEARCH_SECTION [SEARCH_SECTION ...]]
                        provide section(s) to annotate. Choose from: ✍️ALL, ACK,
                        AFF, AUT, CON, DIS, ETH, FIG, INT, KEY, MET, RES, TAB,
                        TIL. Defaults to ALL✍️

--entities [ENTITIES [ENTITIES ...]]
                        provide entities to extract. Default(ALL), or choose from
                        SpaCy: ✍️CARDINAL, DATE, EVENT, FAC, GPE, LANGUAGE, LAW,
                        LOC, MONEY, NORP, ORDINAL, ORG, PERCENT, PERSON,
                        PRODUCT, QUANTITY, TIME, WORK_OF_ART; SciSpaCy:
                        CHEMICAL, DISEASE✍️
                        ⛔️⛔️What about SciSpacy? I think we should include 
                        SciSpacy in the installation and provide instructions for 
                        using SpaCy, SciSpacy or both simultaneously. This would 
                        also show us whether SciSpacy installation is in compatible
                        or breaks the docanalysis installation⛔️⛔️

--synonyms SYNONYMS     searches the corpus/sections with synonymns from ami-dict



========= MAKE/EXPORT OPTIONS =========  
-o OUTPUT, --output OUTPUT
                        outputs csv file ⚠️csv only, or is there a list of options?
                        ⚠️ ⁉️wouldn't tsv be "safer" for chemical 
                        names, etc.?⁉️

⛔️--html HTML           saves output in html format ⁉️to given path⁉️ (can user 
                        choose path?)

--make_json MAKE_JSON   output in json format ⁉️To what end?⁉️

--make_section          makes sections ⁉️ALL? or can these be specified?⁉️

--make_ami_dict MAKE_AMI_DICT
                        provide title for ami-dict. Makes ami-dict of all
                        extracted entities


========= EXPORT FOLDER/PATH OPTIONS =========  
--project_name <project_name> ⛔️replaced capitalization with lower case in “<>"⛔️

⁉️Suggest that we combine project_Name with output_directory <-o <path>, --⛔️outdir⛔️ <path>< (as was used in original version of get papers to avoid confusion about naming a folder and deciding where it goes⁉️
                        Replace "PROJECT_NAME" with your choice of name for the 
                        folder/directory that will be created 
⁉️in your venv? is file path chosen here?⁉️
                        to store/contain the papers you download for further 
                        docanalysis processing.
                        ⁉️(I think --project_folder would be more
                        "for Dummies" user-friendly)⁉️


========= LOG DISPLAY AND EXPORT ========= 
⛔️⛔️-l, --loglevel <level>    amount of information to log (silent, verbose, info*, data, warn, error, or debug)⛔️⛔️

-l LOGLEVEL, --loglevel LOGLEVEL
                        provide logging level. Example --log warning
                        ⛔️choose one? let's add descriptions for each level⛔️<<info,warning,debug,error,critical>>, default='info'

-f LOGFILE, --logfile LOGFILE
                        saves log to specified file in output directory as
                        well as printing to terminal

⁉️(-x -s -t -p and -n)  ⛔️⛔️⛔️ What happened to the other options that were 
                        available in the original version of getpapers?⛔️⛔️⛔️

--api <name>              API to search [eupmc, crossref, ieee, arxiv] (default: eupmc)
-x, --xml             download fulltext XMLs if available
-p, --pdf             download fulltext PDFs if available
-s, --supp            download supplementary files if available
-t, --minedterms          download text-mined terms if available

 

Example commands

+----------------------+-------------+-------------------------+------------+----------------+-----------------+ | Purpose/Category | Command | Sub-Command/Program | Option | Sub-Option | Description | +----------------------+-------------+-------------------------+------------+----------------+-----------------+ | Run Program | docanalysis | | | | | +----------------------+-------------+-------------------------+------------+----------------+-----------------+ | Run Sub-Program | | —run_pygetpapers | | | | +----------------------+-------------+-------------------------+------------+----------------+-----------------+ | | | | | | | +----------------------+-------------+-------------------------+------------+----------------+-----------------+ | | | | | | | +----------------------+-------------+-------------------------+------------+----------------+-----------------+ | | | | | | | +----------------------+-------------+-------------------------+------------+----------------+-----------------+ | | | | | | | +----------------------+-------------+-------------------------+------------+----------------+-----------------+

 

Downloading articles from [EUPMC](https://europepmc.org/)

In the example below, we build a docanalsysis “command” to perform a simple task. (Note: For help building more advanced search queries, see this [EuropePMC Search syntax reference](https://europepmc.org/searchsyntax).)

We begin our command with "docanalysis" (to launch our program); followed by the sub-command “--run_pygetpapers" (to invoke pygetpapers, the docanalysis sub-program that downloads papers from EUPMC); followed by the argument option “-q ” which precedes our search term(s) that begin and end in quotation marks (“terpene”). To specify now many papers we want download, we use the argument option “-k“ followed by the number of papers we desire (in this case, 10) and finally, using the argument option "--project_name“ followed by the name we have chosen for the directory/folder we have named for our project (in this case, “terpene_10”). (See example below.)

 

Example

We want to use docanalysis to run pygetpapers to search for papers containing the term “terpene” and then download 10 of them into a directory we want to be named “terpene_10”

 

COMMAND (Input)

docanalysis --run_pygetpapers -q "terpene" -k 10 --project_name terpene_10

 

Running this command will display this output in our terminal window:

 

LOGS (Displayed Output)

⛔️Somewhere (preferably following the log output itself), we should include a key to decipher the log output⛔️

INFO: making project/searching terpene for 10 hits into C:\Users\MY_COMPUTER\docanalysis\terpene_10
INFO: Total Hits are 13935
1it [00:00, 936.44it/s]
INFO: Saving XML files to C:\Users\MY_COMPUTER\docanalysis\terpene_10\*\fulltext.xml
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:30<00:00,  3.10s/it]

… and export the downloaded files into sub-folders (named by their PMC identification numbers) into the directory we’ve specified to be named called “TERPINE_10” on our machine:

 

CPROJ (Downloaded output)

C:\USERS\MY_COMPUTER\DOCANALYSIS\TERPENE_10
│   eupmc_results.json
│
├───PMC8625850
│       eupmc_result.json
│       fulltext.xml
│
├───PMC8727598
│       eupmc_result.json
│       fulltext.xml
│
├───PMC8747377
│       eupmc_result.json
│       fulltext.xml
│
├───PMC8771452
│       eupmc_result.json
│       fulltext.xml
│
├───PMC8775117
│       eupmc_result.json
│       fulltext.xml
│
├───PMC8801761
│       eupmc_result.json
│       fulltext.xml
│
├───PMC8831285
│       eupmc_result.json
│       fulltext.xml
│
├───PMC8839294
│       eupmc_result.json
│       fulltext.xml
│
├───PMC8840323
│       eupmc_result.json
│       fulltext.xml
│
└───PMC8879232
        eupmc_result.json
        fulltext.xml

 

Section the papers

⛔️⛔️⛔️Why and when do we want to do this??⛔️⛔️⛔️

 

COMMAND

docanalysis --project_name terpene_10 --make_section

 

LOGS

WARNING: Making sections in /content/terpene_10/PMC9095633/fulltext.xml
INFO: dict_keys: dict_keys(['abstract', 'acknowledge', 'affiliation', 'author', 'conclusion', 'discussion', 'ethics', 'fig_caption', 'front', 'introduction', 'jrnl_title', 'keyword', 'method', 'octree', 'pdfimage', 'pub_date', 'publisher', 'reference', 'results_discuss', 'search_results', 'sections', 'svg', 'table', 'title'])
WARNING: loading templates.json
INFO: wrote XML sections for /content/terpene_10/PMC9095633/fulltext.xml /content/terpene_10/PMC9095633/sections
WARNING: Making sections in /content/terpene_10/PMC9120863/fulltext.xml
INFO: wrote XML sections for /content/terpene_10/PMC9120863/fulltext.xml /content/terpene_10/PMC9120863/sections
WARNING: Making sections in /content/terpene_10/PMC8982386/fulltext.xml
INFO: wrote XML sections for /content/terpene_10/PMC8982386/fulltext.xml /content/terpene_10/PMC8982386/sections
WARNING: Making sections in /content/terpene_10/PMC9069239/fulltext.xml
INFO: wrote XML sections for /content/terpene_10/PMC9069239/fulltext.xml /content/terpene_10/PMC9069239/sections
WARNING: Making sections in /content/terpene_10/PMC9165828/fulltext.xml
INFO: wrote XML sections for /content/terpene_10/PMC9165828/fulltext.xml /content/terpene_10/PMC9165828/sections
WARNING: Making sections in /content/terpene_10/PMC9119530/fulltext.xml
INFO: wrote XML sections for /content/terpene_10/PMC9119530/fulltext.xml /content/terpene_10/PMC9119530/sections
WARNING: Making sections in /content/terpene_10/PMC8982077/fulltext.xml
INFO: wrote XML sections for /content/terpene_10/PMC8982077/fulltext.xml /content/terpene_10/PMC8982077/sections
WARNING: Making sections in /content/terpene_10/PMC9067962/fulltext.xml
INFO: wrote XML sections for /content/terpene_10/PMC9067962/fulltext.xml /content/terpene_10/PMC9067962/sections
WARNING: Making sections in /content/terpene_10/PMC9154778/fulltext.xml
INFO: wrote XML sections for /content/terpene_10/PMC9154778/fulltext.xml /content/terpene_10/PMC9154778/sections
WARNING: Making sections in /content/terpene_10/PMC9164016/fulltext.xml
INFO: wrote XML sections for /content/terpene_10/PMC9164016/fulltext.xml /content/terpene_10/PMC9164016/sections

⛔️⛔️⛔️Can we <SNIP> this with an explanation? We're going to have to explain this to the user, preferably at the bottom of this log⛔️⛔️⛔️

 47% 1056/2258 [00:01<00:01, 1003.31it/s]ERROR: cannot parse /content/terpene_10/PMC9165828/sections/1_front/1_article-meta/26_custom-meta-group/0_custom-meta/1_meta-value/0_xref.xml
 67% 1516/2258 [00:01<00:00, 1047.68it/s]ERROR: cannot parse /content/terpene_10/PMC9119530/sections/1_front/1_article-meta/24_custom-meta-group/0_custom-meta/1_meta-value/7_xref.xml
ERROR: cannot parse /content/terpene_10/PMC9119530/sections/1_front/1_article-meta/24_custom-meta-group/0_custom-meta/1_meta-value/14_email.xml
ERROR: cannot parse /content/terpene_10/PMC9119530/sections/1_front/1_article-meta/24_custom-meta-group/0_custom-meta/1_meta-value/3_xref.xml
ERROR: cannot parse /content/terpene_10/PMC9119530/sections/1_front/1_article-meta/24_custom-meta-group/0_custom-meta/1_meta-value/6_xref.xml
ERROR: cannot parse /content/terpene_10/PMC9119530/sections/1_front/1_article-meta/24_custom-meta-group/0_custom-meta/1_meta-value/9_email.xml
ERROR: cannot parse /content/terpene_10/PMC9119530/sections/1_front/1_article-meta/24_custom-meta-group/0_custom-meta/1_meta-value/10_email.xml
ERROR: cannot parse /content/terpene_10/PMC9119530/sections/1_front/1_article-meta/24_custom-meta-group/0_custom-meta/1_meta-value/4_xref.xml
...
⛔️⛔️⛔️We're going to have to explain log warnings, errors, etc., to the user —  preferably at the bottom of this log⛔️⛔️⛔️
100% 2258/2258 [00:02<00:00, 949.43it/s] 

 

CTREE of sectioned papers (Visualisation of folders, sub-folders, and files created/saved in the specified PROJECT_NAME folder.)

⛔️Is this actually shown in the log display, or is it a representation? Can we use a screenshot instead? Shouldn’t we start from the PROJECT_NAME folder and after the first or secnnd PMC folder?⛔️

├───PMC8625850
│   └───sections
│       ├───0_processing-meta
│       ├───1_front
│       │   ├───0_journal-meta
│       │   └───1_article-meta
│       ├───2_body
│       │   ├───0_1._introduction
│       │   ├───1_2._materials_and_methods
│       │   │   ├───1_2.1._materials
│       │   │   ├───2_2.2._bacterial_strains
│       │   │   ├───3_2.3._preparation_and_character
│       │   │   ├───4_2.4._evaluation_of_the_effect_
│       │   │   ├───5_2.5._time-kill_studies
│       │   │   ├───6_2.6._propidium_iodide_uptake-e
│       │   │   └───7_2.7._hemolysis_test_from_human
│       │   ├───2_3._results
│       │   │   ├───1_3.1._encapsulation_of_terpene_
│       │   │   ├───2_3.2._both_terpene_alcohol-load
│       │   │   ├───3_3.3._farnesol_and_geraniol-loa
│       │   │   └───4_3.4._farnesol_and_geraniol-loa
│       │   ├───3_4._discussion
│       │   ├───4_5._conclusions
│       │   └───5_6._patents
│       ├───3_back
│       │   ├───0_ack⛔️rename for clarity?⛔️
│       │   ├───1_fn-group⛔️rename for clarity?⛔️
│       │   │   └───0_fn⛔️rename for clarity?⛔️
│       │   ├───2_app-group
│       │   │   └───0_app
│       │   │       └───2_supplementary-material
│       │   │           └───0_media
│       │   └───9_ref-list
│       └───4_floats-group
│           ├───4_table-wrap⛔️rename for clarity?⛔️
│           ├───5_table-wrap⛔️rename for clarity?⛔️
│           ├───6_table-wrap⛔️rename for clarity?⛔️
│           │   └───4_table-wrap-foot⛔️rename for clarity?⛔️
│           │       └───0_fn⛔️rename for clarity?⛔️
│           ├───7_table-wrap⛔️rename for clarity?⛔️
│           └───8_table-wrap⛔️rename for clarity?⛔️
...

 

Search sections using a dictionary

In ami's terminology, a “dictionary” is a set of terms/phrases in XML format.

Dictionaries related to ethics and acknowledgments are available in [Ethics Dictionary](https://github.com/petermr/docanalysis/tree/main/ethics_dictionary) folder

If you'd like to create a custom dictionary, you can find the steps, [here]

 

Example

COMMAND

docanalysis --project_name terpene_10 --output entities.csv --make_ami_dict entities.xml

 

LOGS

INFO: Found 7134 sentences in the section(s).
INFO: getting terms from /content/activity.xml
100% 7134/7134 [00:02<00:00, 3172.14it/s]
/usr/local/lib/python3.7/dist-packages/docanalysis/entity_extraction.py:352: 
⛔️FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.⛔️
  "[", "").str.replace("]", "")
INFO: wrote output to /content/terpene_10/activity.csv

 

Extract Named Entities

The argument option --spacy_model spacy —entities invokes spacy (a

a free open-source library for Natural Language Processing tool included in docanalysis)

to [extract Named Entitles](https://spacy.io/) from the corpus of text downloaded when we use pygetpapers via docanalysis (docanalysis —run_pygetpapers).

 

Below is the list of Named Entities supported by spacy:

⛔️This information is duplicated in the end credits.

⛔️Suggest we use the full, spelled-out terms for entities, rather than use contractions.

⛔️Are all of these entities found, or can individual ones be selected via flag options?

 

+-------------------+------------------------------------------+------------------------------------------+ | Named Entity: | Description | Examples | +-------------------+------------------------------------------+------------------------------------------+ | CARDINAL | Numerals that do not fall under another | 2, Two, Fifty-two | | | type | | +-------------------+------------------------------------------+------------------------------------------+ | DATE | Absolute or relative dates or periods | 9th May 1987, 4 AUG | +-------------------+------------------------------------------+------------------------------------------+ | EVENT | Nammed hurricanes, battles, wars, sports | Olympic Games | | | events., etc | | +-------------------+------------------------------------------+------------------------------------------+ | FAC | FACILITY: Buildings, airports, | Logan International Airport, The Golden | | | highways, bridges, etc | Gate | +-------------------+------------------------------------------+------------------------------------------+ | GPE | GEO-POLITICAL ENTITIES: Countries, | India, Australia, South East Asia | | | cities, States | | +-------------------+------------------------------------------+------------------------------------------+ | LANGUAGE | Any named language | English, Portuguese, French | +-------------------+------------------------------------------+------------------------------------------+ | LAW | Named documents made into laws | Roe v. Wade | +-------------------+------------------------------------------+------------------------------------------+ | LOC | LOCATION: Non-GPE locations, | Mount Everest, River Ganga | | | mountain ranges, bodies of water | | +-------------------+------------------------------------------+------------------------------------------+ | MONEY | Monetary values, including unit | million dollars, INR 4 Cror | +-------------------+------------------------------------------+------------------------------------------+ | NORP | Nationalities or religious or political | The Republican Party | | | groups | | +-------------------+------------------------------------------+------------------------------------------+ | ORDINAL | first, second, etc | 9th, Ninth | +-------------------+------------------------------------------+------------------------------------------+ | ORG | Companies, agencies, institutions, etc | Microsoft, Facebook, FBI, MIT | +-------------------+------------------------------------------+------------------------------------------+ | PERCENT | Percentage, including “%" | Eighty percent | +-------------------+------------------------------------------+------------------------------------------+ | PERSON | People, including fictional | Bill Clinton, Fred Flintstone | +-------------------+------------------------------------------+------------------------------------------+ | PRODUCT | Objects, vehicles, foods, etc. (Not | Formula 1 | | | services.) | | +-------------------+------------------------------------------+------------------------------------------+ | QUANTITY | Measurements, as of weight or distance | Several kilometers, 55kg | +-------------------+------------------------------------------+------------------------------------------+ | TIME | Times smaller than a day | 7:23 A.M., three-forty am, Four hours | +-------------------+------------------------------------------+------------------------------------------+ | WORK_OF_ART | Titles of books, songs, etc. | The Mona Lisa | +-------------------+------------------------------------------+------------------------------------------+

 

Example

INPUT

docanalysis --project_name terpene_10 --make_section --spacy_model spacy --entities ORG --output org.csv

 

LOGS

INFO: Found 7134 sentences in the section(s).
INFO: Loading spacy
100% 7134/7134 [01:08<00:00, 104.16it/s]
/usr/local/lib/python3.7/dist-packages/docanalysis/entity_extraction.py:352: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
  "[", "").str.replace("]", "")
INFO: wrote output to /content/terpene_10/org.csv

 

Extract information from specific section(s)

You can choose to extract entities from specific sections

 

Example

COMMAND

docanalysis --project_name terpene_10 --make_section --spacy_model spacy --search_section AUT, AFF --entities ORG --output org_aut_aff.csv

 

LOG

INFO: Found 28 sentences in the section(s).
INFO: Loading spacy
100% 28/28 [00:00<00:00, 106.66it/s]
/usr/local/lib/python3.7/dist-packages/docanalysis/entity_extraction.py:352: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
  "[", "").str.replace("]", "")
INFO: wrote output to /content/terpene_10/org_aut_aff.csv

 

Create dictionary of extracted entities

 

COMMAND

docanalysis --project_name terpene_10 --make_section --spacy_model spacy --search_section AUT, AFF --entities ORG --output org_aut_aff.csvv --make_ami_dict org

 

LOG

INFO: Found 28 sentences in the section(s).
INFO: Loading spacy
100% 28/28 [00:00<00:00, 96.56it/s] 
/usr/local/lib/python3.7/dist-packages/docanalysis/entity_extraction.py:352: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
  "[", "").str.replace("]", "")
INFO: wrote output to /content/terpene_10/org_aut_aff.csvv
INFO: Wrote all the entities extracted to ami dict

 

Snippet of the dictionary

<?xml version="1.0"?>
- dictionary title="/content/terpene_10/org.xml">
<entry count="2" term="Department of Biochemistry"/>
<entry count="2" term="Chinese Academy of Agricultural Sciences"/>
<entry count="2" term="Tianjin University"/>
<entry count="2" term="Desert Research Center"/>
<entry count="2" term="Chinese Academy of Sciences"/>
<entry count="2" term="University of Colorado Boulder"/>
<entry count="2" term="Department of Neurology"/>
<entry count="1" term="Max Planck Institute for Chemical Ecology"/>
<entry count="1" term="College of Forest Resources and Environmental Science"/>
<entry count="1" term="Michigan Technological University"/>https://github.com/petermr/docanalysis/blob/main/README.md#what-is-a-dictionary

 

All at one go!

docanalysis --run_pygetpapers -q "terpene" -k 10 --project_name terpene_10 --make_section --output entities_202202019.csv --make_ami_dict entities_20220209.xml 

 

credits:

developers

 

special thanks

  • if any

 

technologies

[pygetpapers](https://github.com/petermr/pygetpapers) — searches for and downloads papers from [europepmc.org (“EUPMC”)](www.europepmc.org) (.html, .xml, .pdf, and/or .json)

[NLTK](https://www.nltk.org/) and other Python tools for many operations, and

 

that ingests [CProjects](https://github.com/petermr/tigr2ess/blob/master/getpapers/TUTORIAL.md#cproject-and-ctrees) and carries out text-analysis of documents, including sectioning, NLP/text-mining, vocabulary generation. Uses [NLTK](https://www.nltk.org/) and other Python tools for many operations, and [spaCy](https://spacy.io/) or [scispaCy](https://allenai.github.io/scispacy/) for extraction and annotation of entities. Outputs summary data and word-dictionaries.

extraction

docanalysis integrates and leverages the power of the following open-source technologies:

Clone this wiki locally