Skip to content

Reports for testers

tlhahsn edited this page Apr 8, 2021 · 15 revisions

Format: Name of the tester, Test Performed, Result (If error, then start a github issue)

Tester 1: Radhu Ladani

OS: Windows 10

Date: 7th April 2021

Running pygetpapers in commandline

  • Check the Installation of pygetpapers and it's prerequisites for the set-up.
  • This command will install the updated version of pygetpapers pip install git+git://github.com/petermr/pygetpapers.
C:\Users\DELL>pip3 install git+git://github.com/petermr/pygetpapers
Collecting git+git://github.com/petermr/pygetpapers
  Cloning git://github.com/petermr/pygetpapers to c:\users\dell\appdata\local\temp\pip-req-build-6l3rldns
  Running command git clone -q git://github.com/petermr/pygetpapers 'C:\Users\DELL\AppData\Local\Temp\pip-req-build-6l3rldns'
Requirement already satisfied: requests in c:\users\dell\appdata\local\programs\python\python39\lib\site-packages (from pygetpapers==0.0.3.1) (2.20.0)
Requirement already satisfied: pandas_read_xml in c:\users\dell\appdata\local\programs\python\python39\lib\site-packages (from pygetpapers==0.0.3.1) (0.0.9)
Requirement already satisfied: pandas in c:\users\dell\appdata\local\programs\python\python39\lib\site-packages (from pygetpapers==0.0.3.1) (1.2.0)
Requirement already satisfied: lxml in c:\users\dell\appdata\local\programs\python\python39\lib\site-packages (from pygetpapers==0.0.3.1) (4.6.2)
Requirement already satisfied: chromedriver_autoinstaller in c:\users\dell\appdata\local\programs\python\python39\lib\site-packages (from pygetpapers==0.0.3.1) (0.2.2)
Requirement already satisfied: xmltodict in c:\users\dell\appdata\local\programs\python\python39\lib\site-packages (from pygetpapers==0.0.3.1) (0.12.0)
Requirement already satisfied: selenium in c:\users\dell\appdata\local\programs\python\python39\lib\site-packages (from pygetpapers==0.0.3.1) (3.12.0)
Requirement already satisfied: numpy>=1.16.5 in c:\users\dell\appdata\local\programs\python\python39\lib\site-packages (from pandas->pygetpapers==0.0.3.1) (1.20.1)
Requirement already satisfied: python-dateutil>=2.7.3 in c:\users\dell\appdata\local\programs\python\python39\lib\site-packages (from pandas->pygetpapers==0.0.3.1) (2.8.1)
Requirement already satisfied: pytz>=2017.3 in c:\users\dell\appdata\local\programs\python\python39\lib\site-packages (from pandas->pygetpapers==0.0.3.1) (2021.1)
Requirement already satisfied: six>=1.5 in c:\users\dell\appdata\local\programs\python\python39\lib\site-packages (from python-dateutil>=2.7.3->pandas->pygetpapers==0.0.3.1) (1.15.0)
Requirement already satisfied: zipfile36 in c:\users\dell\appdata\local\programs\python\python39\lib\site-packages (from pandas_read_xml->pygetpapers==0.0.3.1) (0.1.3)
Requirement already satisfied: distlib in c:\users\dell\appdata\local\programs\python\python39\lib\site-packages (from pandas_read_xml->pygetpapers==0.0.3.1) (0.3.1)
Requirement already satisfied: pyarrow in c:\users\dell\appdata\local\programs\python\python39\lib\site-packages (from pandas_read_xml->pygetpapers==0.0.3.1) (3.0.0)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in c:\users\dell\appdata\local\programs\python\python39\lib\site-packages (from requests->pygetpapers==0.0.3.1) (3.0.4)
Requirement already satisfied: idna<2.8,>=2.5 in c:\users\dell\appdata\local\programs\python\python39\lib\site-packages (from requests->pygetpapers==0.0.3.1) (2.7)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\dell\appdata\local\programs\python\python39\lib\site-packages (from requests->pygetpapers==0.0.3.1) (2020.12.5)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in c:\users\dell\appdata\local\programs\python\python39\lib\site-packages (from requests->pygetpapers==0.0.3.1) (1.24.3)
Using legacy 'setup.py install' for pygetpapers, since package 'wheel' is not installed.
Installing collected packages: pygetpapers
  Attempting uninstall: pygetpapers
    Found existing installation: pygetpapers 0.0.1
    Uninstalling pygetpapers-0.0.1:
      Successfully uninstalled pygetpapers-0.0.1
    Running setup.py install for pygetpapers ... done
Successfully installed pygetpapers-0.0.3.1

Running pygetpapers on the cmd

  • Run pygetpapers --help
Output
C:\Users\DELL>pygetpapers --help
usage: pygetpapers [-h] [-v] [-q QUERY] [-o OUTPUT] [-x] [-p] [-s] [--references REFERENCES] [-n]
                   [--citations CITATIONS] [-l LOGLEVEL] [-f LOGFILE] [-k LIMIT] [-r RESTART] [-u UPDATE]
                   [--onlyquery] [-c] [--synonym]

Welcome to Pygetpapers version 0.0.3.1. -h or --help for help

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         output the version number
  -q QUERY, --query QUERY
                        query string transmitted to repository API. Eg. 'Artificial Intelligence' or 'Plant Parts'. To
                        escape special characters within the quotes, use backslash. The query to be quoted in either
                        single or double quotes.
  -o OUTPUT, --output OUTPUT
                        output directory (Default: current working directory)
  -x, --xml             download fulltext XMLs if available
  -p, --pdf             download fulltext PDFs if available
  -s, --supp            download supplementary files if available
  --references REFERENCES
                        Download references if available. Requires source for references
                        (AGR,CBA,CTX,ETH,HIR,MED,PAT,PMC,PPR).
  -n, --noexecute       report how many results match the query, but don't actually download anything
  --citations CITATIONS
                        Download citations if available. Requires source for citations
                        (AGR,CBA,CTX,ETH,HIR,MED,PAT,PMC,PPR).
  -l LOGLEVEL, --loglevel LOGLEVEL
                        Provide logging level. Example --log warning <<info,warning,debug,error,critical>>,
                        default='info'
  -f LOGFILE, --logfile LOGFILE
                        save log to specified file in output directory as well as printing to terminal
  -k LIMIT, --limit LIMIT
                        maximum number of hits (default: 100)
  -r RESTART, --restart RESTART
                        Reads the json and makes the xml files. Takes the path to the json as the input
  -u UPDATE, --update UPDATE
                        Updates the corpus by downloading new papers. Takes the path of metadata json file of the
                        orignal corpus as the input. Requires -k or --limit (If not provided, default will be used)
                        and -q or --query (must be provided) to be given. Takes the path to the json as the input.
  --onlyquery           Saves json file containing the result of the query in storage. The json file can be given to
                        --restart to download the papers later.
  -c, --makecsv         Stores the per-document metadata as csv. Works only with --api method.
  --synonym             Results contain synonyms as well.

Test Performed

  • Example query: pygetpapers -q "Medicinal Activity" -k 10 -o "output" -x -p -c -s
  • In this command - -x (--xml) download fulltext XMLs if available, -p (--pdf) download fulltext PDFs if available, -s (--supp) download supplementary files if available, -c(--makecsv)Stores the per-document metadata as csv. Works only with --api method.
  • The command created "output" folder in the current directory within this folder, giving limited papers downloaded with PMC ID folder name (Eg:PMC7751408).
  • This PMC ID folder contains eupmc_result- JSON file, Fullltext csv, pdf, xml and supplementary files as well.
  • Apart from that,
    • a .csv file with PMC id, HTML link,Keywords, pdf link, journaltitle and the author info was created.
  • Example query 2: pygetpapers -q "Medicinal Activity" -k 10 -o "out_test" -x -p -c -s -l "info"
    • In this command -l (--loglevel) LOGLEVEL Provide logging level such as info, warning, debug, error, critical
C:\Users\DELL>pygetpapers -q "Medicinal Activity" -k 10 -o "out_test" -x -p -c -s -l "info"
INFO: Total Hits are 206841
INFO: Saving XML files to C:\Users\DELL\out_test\*\fulltext.xml
INFO: Made Supplementary files for PMC7822064
INFO: */Wrote xml for PMC7822064/
INFO: Wrote the pdf file for PMC7822064
INFO: Made Supplementary files for PMC7993383
INFO: */Wrote xml for PMC7993383/
INFO: Wrote the pdf file for PMC7993383
INFO: Made Supplementary files for PMC7939573
INFO: */Wrote xml for PMC7939573/
INFO: Wrote the pdf file for PMC7939573
INFO: Made Supplementary files for PMC7833026
INFO: */Wrote xml for PMC7833026/
INFO: Wrote the pdf file for PMC7833026
INFO: Made Supplementary files for PMC7808749
INFO: */Wrote xml for PMC7808749/
INFO: Wrote the pdf file for PMC7808749
INFO: Made Supplementary files for PMC7751408
INFO: */Wrote xml for PMC7751408/
INFO: Wrote the pdf file for PMC7751408
INFO: Made Supplementary files for PMC7889190
INFO: */Wrote xml for PMC7889190/
INFO: Wrote the pdf file for PMC7889190
INFO: Made Supplementary files for PMC7850424
INFO: */Wrote xml for PMC7850424/
INFO: Wrote the pdf file for PMC7850424
INFO: Made Supplementary files for PMC7782983
INFO: */Wrote xml for PMC7782983/
INFO: Wrote the pdf file for PMC7782983
INFO: Made Supplementary files for PMC7782162
INFO: */Wrote xml for PMC7782162/
INFO: Wrote the pdf file for PMC7782162

Tester2: Kanishka Parashar

OS: Windows 10

Date: 7April2021

Installation of pygetpapers

give command in cmd:pip install git+git://github.com/petermr/pygetpapers
Collecting git+git://github.com/petermr/pygetpapers
  Cloning git://github.com/petermr/pygetpapers to c:\users\hp pc\appdata\local\temp\pip-req-build-id5w7re6
Requirement already satisfied: requests in c:\users\hp pc\appdata\local\programs\python\python39\lib\site-packages (from pygetpapers==0.0.3.1) (2.20.0)
Requirement already satisfied: pandas_read_xml in c:\users\hp pc\appdata\local\programs\python\python39\lib\site-packages (from pygetpapers==0.0.3.1) (0.0.9)
Requirement already satisfied: pandas in c:\users\hp pc\appdata\local\programs\python\python39\lib\site-packages (from pygetpapers==0.0.3.1) (1.2.0)
Requirement already satisfied: lxml in c:\users\hp pc\appdata\local\programs\python\python39\lib\site-packages (from pygetpapers==0.0.3.1) (4.6.2)
Requirement already satisfied: chromedriver_autoinstaller in c:\users\hp pc\appdata\local\programs\python\python39\lib\site-packages (from pygetpapers==0.0.3.1) (0.2.2)
Requirement already satisfied: xmltodict in c:\users\hp pc\appdata\local\programs\python\python39\lib\site-packages (from pygetpapers==0.0.3.1) (0.12.0)
Requirement already satisfied: selenium in c:\users\hp pc\appdata\local\programs\python\python39\lib\site-packages (from pygetpapers==0.0.3.1) (3.12.0)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\hp pc\appdata\local\programs\python\python39\lib\site-packages (from requests->pygetpapers==0.0.3.1) (2020.12.5)
Requirement already satisfied: idna<2.8,>=2.5 in c:\users\hp pc\appdata\local\programs\python\python39\lib\site-packages (from requests->pygetpapers==0.0.3.1) (2.7)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in c:\users\hp pc\appdata\local\programs\python\python39\lib\site-packages (from requests->pygetpapers==0.0.3.1) (3.0.4)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in c:\users\hp pc\appdata\local\programs\python\python39\lib\site-packages (from requests->pygetpapers==0.0.3.1) (1.24.3)
Requirement already satisfied: zipfile36 in c:\users\hp pc\appdata\local\programs\python\python39\lib\site-packages (from pandas_read_xml->pygetpapers==0.0.3.1) (0.1.3)
Requirement already satisfied: pyarrow in c:\users\hp pc\appdata\local\programs\python\python39\lib\site-packages (from pandas_read_xml->pygetpapers==0.0.3.1) (3.0.0)
Requirement already satisfied: distlib in c:\users\hp pc\appdata\local\programs\python\python39\lib\site-packages (from pandas_read_xml->pygetpapers==0.0.3.1) (0.3.1)
Requirement already satisfied: numpy>=1.16.5 in c:\users\hp pc\appdata\local\programs\python\python39\lib\site-packages (from pandas->pygetpapers==0.0.3.1) (1.20.1)
Requirement already satisfied: pytz>=2017.3 in c:\users\hp pc\appdata\local\programs\python\python39\lib\site-packages (from pandas->pygetpapers==0.0.3.1) (2021.1)
Requirement already satisfied: python-dateutil>=2.7.3 in c:\users\hp pc\appdata\local\programs\python\python39\lib\site-packages (from pandas->pygetpapers==0.0.3.1) (2.8.1)
Requirement already satisfied: six>=1.5 in c:\users\hp pc\appdata\local\programs\python\python39\lib\site-packages (from python-dateutil>=2.7.3->pandas->pygetpapers==0.0.3.1) (1.15.0)
Using legacy 'setup.py install' for pygetpapers, since package 'wheel' is not installed.
Installing collected packages: pygetpapers
  Attempting uninstall: pygetpapers
    Found existing installation: pygetpapers 0.0.1
    Uninstalling pygetpapers-0.0.1:
      Successfully uninstalled pygetpapers-0.0.1
    Running setup.py install for pygetpapers ... done
Successfully installed pygetpapers-0.0.3.1
WARNING: You are using pip version 20.2.3; however, version 21.0.1 is available.
You should consider upgrading via the 'c:\users\hp pc\appdata\local\programs\python\python39\python.exe -m pip install --upgrade pip' command.

Tester 3: Vasant Kumar

OS: Windows 10

Date: 7th April 2021

Installation of pygetpapers and running query

  • Update your installation using pip install git+git://github.com/petermr/pygetpapers on your commandline and the new version of pygetpapers will get installed.
  • Output
C:\Users\vasan>pip install git+git://github.com/petermr/pygetpapers
Collecting git+git://github.com/petermr/pygetpapers
  Cloning git://github.com/petermr/pygetpapers to c:\users\vasan\appdata\local\temp\pip-req-build-7d0v_az0
Requirement already satisfied: requests in c:\users\vasan\appdata\local\programs\python\python39\lib\site-packages (from pygetpapers==0.0.3.1) (2.25.1)
Requirement already satisfied: pandas_read_xml in c:\users\vasan\appdata\roaming\python\python39\site-packages (from pygetpapers==0.0.3.1) (0.0.9)
Requirement already satisfied: pandas in c:\users\vasan\appdata\roaming\python\python39\site-packages (from pygetpapers==0.0.3.1) (1.2.0)
Requirement already satisfied: lxml in c:\users\vasan\appdata\roaming\python\python39\site-packages (from pygetpapers==0.0.3.1) (4.6.2)
Requirement already satisfied: chromedriver_autoinstaller in c:\users\vasan\appdata\roaming\python\python39\site-packages (from pygetpapers==0.0.3.1) (0.2.2)
Requirement already satisfied: xmltodict in c:\users\vasan\appdata\local\programs\python\python39\lib\site-packages (from pygetpapers==0.0.3.1) (0.12.0)
Requirement already satisfied: selenium in c:\users\vasan\appdata\roaming\python\python39\site-packages (from pygetpapers==0.0.3.1) (3.12.0)
Requirement already satisfied: idna<3,>=2.5 in c:\users\vasan\appdata\local\programs\python\python39\lib\site-packages (from requests->pygetpapers==0.0.3.1) (2.10)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in c:\users\vasan\appdata\local\programs\python\python39\lib\site-packages (from requests->pygetpapers==0.0.3.1) (1.26.4)
Requirement already satisfied: chardet<5,>=3.0.2 in c:\users\vasan\appdata\local\programs\python\python39\lib\site-packages (from requests->pygetpapers==0.0.3.1) (4.0.0)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\vasan\appdata\local\programs\python\python39\lib\site-packages (from requests->pygetpapers==0.0.3.1) (2020.12.5)
Requirement already satisfied: zipfile36 in c:\users\vasan\appdata\local\programs\python\python39\lib\site-packages (from pandas_read_xml->pygetpapers==0.0.3.1) (0.1.3)
Requirement already satisfied: pyarrow in c:\users\vasan\appdata\roaming\python\python39\site-packages (from pandas_read_xml->pygetpapers==0.0.3.1) (3.0.0)
Requirement already satisfied: distlib in c:\users\vasan\appdata\local\programs\python\python39\lib\site-packages (from pandas_read_xml->pygetpapers==0.0.3.1) (0.3.1)
Requirement already satisfied: numpy>=1.16.5 in c:\users\vasan\appdata\local\programs\python\python39\lib\site-packages (from pandas->pygetpapers==0.0.3.1) (1.20.1)
Requirement already satisfied: python-dateutil>=2.7.3 in c:\users\vasan\appdata\roaming\python\python39\site-packages (from pandas->pygetpapers==0.0.3.1) (2.8.1)
Requirement already satisfied: pytz>=2017.3 in c:\users\vasan\appdata\roaming\python\python39\site-packages (from pandas->pygetpapers==0.0.3.1) (2021.1)
Requirement already satisfied: six>=1.5 in c:\users\vasan\appdata\roaming\python\python39\site-packages (from python-dateutil>=2.7.3->pandas->pygetpapers==0.0.3.1) (1.15.0)
Building wheels for collected packages: pygetpapers
  Building wheel for pygetpapers (setup.py) ... done
  Created wheel for pygetpapers: filename=pygetpapers-0.0.3.1-py2.py3-none-any.whl size=15228 sha256=f33360e30867278ef54c94a6c44b077a552c1f7ae94d33645d9c7cea0335f2ff
  Stored in directory: C:\Users\vasan\AppData\Local\Temp\pip-ephem-wheel-cache-cbi4pcjd\wheels\91\d1\11\341c5b9440e416ab82c2d7b3ce086fb12256db35effd396391
Successfully built pygetpapers
Installing collected packages: pygetpapers
Successfully installed pygetpapers-0.0.3.1

Check your installation by running pygetpapers --help on your commandline.

  • Output
C:\Users\vasan>pygetpapers --help
usage: pygetpapers [-h] [-v] [-q QUERY] [-o OUTPUT] [-x] [-p] [-s] [--references REFERENCES] [-n]
                   [--citations CITATIONS] [-l LOGLEVEL] [-f LOGFILE] [-k LIMIT] [-r RESTART] [-u UPDATE]
                   [--onlyquery] [-c] [--synonym]

Welcome to Pygetpapers version 0.0.3.1. -h or --help for help

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         output the version number
  -q QUERY, --query QUERY
                        query string transmitted to repository API. Eg. 'Artificial Intelligence' or 'Plant Parts'. To
                        escape special characters within the quotes, use backslash. The query to be quoted in either
                        single or double quotes.
  -o OUTPUT, --output OUTPUT
                        output directory (Default: current working directory)
  -x, --xml             download fulltext XMLs if available
  -p, --pdf             download fulltext PDFs if available
  -s, --supp            download supplementary files if available
  --references REFERENCES
                        Download references if available. Requires source for references
                        (AGR,CBA,CTX,ETH,HIR,MED,PAT,PMC,PPR).
  -n, --noexecute       report how many results match the query, but don't actually download anything
  --citations CITATIONS
                        Download citations if available. Requires source for citations
                        (AGR,CBA,CTX,ETH,HIR,MED,PAT,PMC,PPR).
  -l LOGLEVEL, --loglevel LOGLEVEL
                        Provide logging level. Example --log warning <<info,warning,debug,error,critical>>,
                        default='info'
  -f LOGFILE, --logfile LOGFILE
                        save log to specified file in output directory as well as printing to terminal
  -k LIMIT, --limit LIMIT
                        maximum number of hits (default: 100)
  -r RESTART, --restart RESTART
                        Reads the json and makes the xml files. Takes the path to the json as the input
  -u UPDATE, --update UPDATE
                        Updates the corpus by downloading new papers. Takes the path of metadata json file of the
                        orignal corpus as the input. Requires -k or --limit (If not provided, default will be used)
                        and -q or --query (must be provided) to be given. Takes the path to the json as the input.
  --onlyquery           Saves json file containing the result of the query in storage. The json file can be given to
                        --restart to download the papers later.
  -c, --makecsv         Stores the per-document metadata as csv. Works only with --api method.
  --synonym             Results contain synonyms as well.

Test query 1

  • query commmand : pygetpapers -q "Plant genes" -o "testingfiles" -s -p -c -x -k 10
  • A folder created named "testingfiles" containing 10 papers with PMC ID and each PMC ID includes JSON file, fullltext csv, pdf, xml and supplementary files.

Test query 2

  • pygetpapers -q "Plant genes" -o "testing_files" -s -p -c -x -k 10 -l "info"
  • -l (--loglevel) and it provides logging level such as info, warning, debug, error, critical
  • Output
C:\Users\vasan>pygetpapers -q "Plant genes" -o "testing_files" -s -p -c -x -k 10 -l "info"
INFO: Total Hits are 325273
WARNING: Keywords not found for paper 1
WARNING: Keywords not found for paper 4
WARNING: html url not found for paper 5
WARNING: Keywords not found for paper 5
WARNING: pdf url not found for paper 5
WARNING: Keywords not found for paper 10
INFO: Saving XML files to C:\Users\vasan\testing_files\*\fulltext.xml
INFO: Made Supplementary files for PMC7736860
INFO: */Wrote xml for PMC7736860/
INFO: Wrote the pdf file for PMC7736860
INFO: Made Supplementary files for PMC6874142
INFO: */Wrote xml for PMC6874142/
INFO: Wrote the pdf file for PMC6874142
INFO: Made Supplementary files for PMC7516213
INFO: */Wrote xml for PMC7516213/
INFO: Wrote the pdf file for PMC7516213
INFO: Made Supplementary files for PMC7383801
INFO: */Wrote xml for PMC7383801/
INFO: Wrote the pdf file for PMC7383801
INFO: Made Supplementary files for PMC7001462
INFO: */Wrote xml for PMC7001462/
INFO: Made Supplementary files for PMC6777021
INFO: */Wrote xml for PMC6777021/
INFO: Wrote the pdf file for PMC6777021
INFO: Made Supplementary files for PMC6296014
INFO: */Wrote xml for PMC6296014/
INFO: Wrote the pdf file for PMC6296014
INFO: Made Supplementary files for PMC5664361
INFO: */Wrote xml for PMC5664361/
INFO: Wrote the pdf file for PMC5664361
INFO: Made Supplementary files for PMC5343966
INFO: */Wrote xml for PMC5343966/
INFO: Wrote the pdf file for PMC5343966
INFO: Made Supplementary files for PMC5596367
INFO: */Wrote xml for PMC5596367/
INFO: Wrote the pdf file for PMC5596367

Tester 4: Talha Hasan

OS: Windows 10

Date: 8th April 2021

Running pygetpapers in commandline

  • This command will install the updated version of pygetpapers pip install git+git://github.com/petermr/pygetpapers.
C:\Users\talha>pip3 install git+git://github.com/petermr/pygetpapers
Collecting git+git://github.com/petermr/pygetpapers
  Cloning git://github.com/petermr/pygetpapers to c:\users\talha\appdata\local\temp\pip-req-build-r70bxipr
  Running command git clone -q git://github.com/petermr/pygetpapers 'C:\Users\talha\AppData\Local\Temp\pip-req-build-r70bxipr'
Requirement already satisfied: requests in c:\users\talha\appdata\roaming\python\python39\site-packages (from pygetpapers==0.0.3.1) (2.25.1)
Requirement already satisfied: pandas_read_xml in c:\users\talha\appdata\roaming\python\python39\site-packages (from pygetpapers==0.0.3.1) (0.0.9)
Requirement already satisfied: pandas in c:\users\talha\appdata\roaming\python\python39\site-packages (from pygetpapers==0.0.3.1) (1.2.3)
Requirement already satisfied: lxml in c:\users\talha\appdata\roaming\python\python39\site-packages (from pygetpapers==0.0.3.1) (4.6.2)
Requirement already satisfied: chromedriver_autoinstaller in c:\users\talha\appdata\roaming\python\python39\site-packages (from pygetpapers==0.0.3.1) (0.2.2)
Requirement already satisfied: xmltodict in c:\users\talha\appdata\roaming\python\python39\site-packages (from pygetpapers==0.0.3.1) (0.12.0)
Requirement already satisfied: selenium in c:\users\talha\appdata\roaming\python\python39\site-packages (from pygetpapers==0.0.3.1) (3.141.0)
Requirement already satisfied: numpy>=1.16.5 in c:\users\talha\appdata\roaming\python\python39\site-packages (from pandas->pygetpapers==0.0.3.1) (1.20.1)
Requirement already satisfied: pytz>=2017.3 in c:\users\talha\appdata\roaming\python\python39\site-packages (from pandas->pygetpapers==0.0.3.1) (2021.1)
Requirement already satisfied: python-dateutil>=2.7.3 in c:\users\talha\appdata\roaming\python\python39\site-packages (from pandas->pygetpapers==0.0.3.1) (2.8.1)
Requirement already satisfied: six>=1.5 in c:\users\talha\appdata\roaming\python\python39\site-packages (from python-dateutil>=2.7.3->pandas->pygetpapers==0.0.3.1) (1.15.0)
Requirement already satisfied: pyarrow in c:\users\talha\appdata\roaming\python\python39\site-packages (from pandas_read_xml->pygetpapers==0.0.3.1) (3.0.0)
Requirement already satisfied: distlib in c:\users\talha\appdata\roaming\python\python39\site-packages (from pandas_read_xml->pygetpapers==0.0.3.1) (0.3.1)
Requirement already satisfied: zipfile36 in c:\users\talha\appdata\roaming\python\python39\site-packages (from pandas_read_xml->pygetpapers==0.0.3.1) (0.1.3)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\talha\appdata\roaming\python\python39\site-packages (from requests->pygetpapers==0.0.3.1) (2020.12.5)
Requirement already satisfied: idna<3,>=2.5 in c:\users\talha\appdata\roaming\python\python39\site-packages (from requests->pygetpapers==0.0.3.1) (2.10)
Requirement already satisfied: chardet<5,>=3.0.2 in c:\users\talha\appdata\roaming\python\python39\site-packages (from requests->pygetpapers==0.0.3.1) (4.0.0)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in c:\users\talha\appdata\roaming\python\python39\site-packages (from requests->pygetpapers==0.0.3.1) (1.26.4)

Running pygetpapers on the cmd

  • Run pygetpapers --help
Output
C:\Users\talha>pygetpapers --help
usage: pygetpapers [-h] [-v] [-q QUERY] [-o OUTPUT] [-x] [-p] [-s] [--references REFERENCES] [-n] [--citations CITATIONS] [-l LOGLEVEL] [-f LOGFILE] [-k LIMIT]
                   [-r RESTART] [-u UPDATE] [--onlyquery] [-c] [--synonym]

Welcome to Pygetpapers version 0.0.3.1. -h or --help for help

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         output the version number
  -q QUERY, --query QUERY
                        query string transmitted to repository API. Eg. 'Artificial Intelligence' or 'Plant Parts'. To escape special characters within the quotes,
                        use backslash. The query to be quoted in either single or double quotes.
  -o OUTPUT, --output OUTPUT
                        output directory (Default: current working directory)
  -x, --xml             download fulltext XMLs if available
  -p, --pdf             download fulltext PDFs if available
  -s, --supp            download supplementary files if available
  --references REFERENCES
                        Download references if available. Requires source for references (AGR,CBA,CTX,ETH,HIR,MED,PAT,PMC,PPR).
  -n, --noexecute       report how many results match the query, but don't actually download anything
  --citations CITATIONS
                        Download citations if available. Requires source for citations (AGR,CBA,CTX,ETH,HIR,MED,PAT,PMC,PPR).
  -l LOGLEVEL, --loglevel LOGLEVEL
                        Provide logging level. Example --log warning <<info,warning,debug,error,critical>>, default='info'
  -f LOGFILE, --logfile LOGFILE
                        save log to specified file in output directory as well as printing to terminal
  -k LIMIT, --limit LIMIT
                        maximum number of hits (default: 100)
  -r RESTART, --restart RESTART
                        Reads the json and makes the xml files. Takes the path to the json as the input
  -u UPDATE, --update UPDATE
                        Updates the corpus by downloading new papers. Takes the path of metadata json file of the orignal corpus as the input. Requires -k or --limit
                        (If not provided, default will be used) and -q or --query (must be provided) to be given. Takes the path to the json as the input.
  --onlyquery           Saves json file containing the result of the query in storage. The json file can be given to --restart to download the papers later.
  -c, --makecsv         Stores the per-document metadata as csv. Works only with --api method.
  --synonym             Results contain synonyms as well.

Test Performed

  • Example query: pygetpapers -q "Medicinal Activity" -k 10 -o "output" -x -p -c -s
  • In this command - -x (--xml) download fulltext XMLs if available, -p (--pdf) download fulltext PDFs if available, -s (--supp) download supplementary files if available, -c(--makecsv)Stores the per-document metadata as csv. Works only with --api method.
  • The command created "output" folder in the current directory within this folder, giving limited papers downloaded with PMC ID folder name (Eg:PMC7751408).
  • This PMC ID folder contains eupmc_result- JSON file, Fulltext csv, pdf, xml and supplementary files as well.
  • Apart from that,
    • a .csv file with PMC id, HTML link, Keywords, pdf link, journaltitle and the author info was created.
  • Example query 2: pygetpapers -q "Medicinal Activity" -k 10 -o "out_test" -x -p -c -s -l "info"
    • In this command -l (--loglevel) LOGLEVEL Provide logging level such as info, warning, debug, error, critical
C:\Users\talha>pygetpapers -q "Medicinal Activity" -k 10 -o "out_test" -x -p -c -s -l "info"
INFO: Total Hits are 206841
INFO: Saving XML files to C:\Users\DELL\out_test\*\fulltext.xml
INFO: Made Supplementary files for PMC7822064
INFO: */Wrote xml for PMC7822064/
INFO: Wrote the pdf file for PMC7822064
INFO: Made Supplementary files for PMC7993383
INFO: */Wrote xml for PMC7993383/
INFO: Wrote the pdf file for PMC7993383
INFO: Made Supplementary files for PMC7939573
INFO: */Wrote xml for PMC7939573/
INFO: Wrote the pdf file for PMC7939573
INFO: Made Supplementary files for PMC7833026
INFO: */Wrote xml for PMC7833026/
INFO: Wrote the pdf file for PMC7833026
INFO: Made Supplementary files for PMC7808749
INFO: */Wrote xml for PMC7808749/
INFO: Wrote the pdf file for PMC7808749
INFO: Made Supplementary files for PMC7751408
INFO: */Wrote xml for PMC7751408/
INFO: Wrote the pdf file for PMC7751408
INFO: Made Supplementary files for PMC7889190
INFO: */Wrote xml for PMC7889190/
INFO: Wrote the pdf file for PMC7889190
INFO: Made Supplementary files for PMC7850424
INFO: */Wrote xml for PMC7850424/
INFO: Wrote the pdf file for PMC7850424
INFO: Made Supplementary files for PMC7782983
INFO: */Wrote xml for PMC7782983/
INFO: Wrote the pdf file for PMC7782983
INFO: Made Supplementary files for PMC7782162
INFO: */Wrote xml for PMC7782162/
INFO: Wrote the pdf file for PMC7782162