Skip to content

Script to retrieve metadata for the FTP data from the baseline of PubMed

Notifications You must be signed in to change notification settings

SergiAguilo/retrieveMetadataPubMed

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

retrieveMetadataPubMed

Script to retrieve metadata for the FTP data from the baseline of PubMed

The metadata retrieved is the following: Title, abstract, year, pubmed id, doi, pmcid, ISSN, ISO abbreviation and the label and Id of the Mesh Tags.

As the script is created to parse large amounts of data (~ 40G is the full PubMed Metadata in 2022), it is stored in a SQLite database.

Python 3.5 or later is needed. The script depends on standard libraries, plus the ones declared in requirements.txt.

  • In order to install the dependencies you need pip and venv Python modules.

    • pip is available in many Linux distributions (Ubuntu package python-pip, CentOS EPEL package python-pip), and also as pip Python package.
    • venv is also available in many Linux distributions (Ubuntu package python3-venv). In some of these distributions venv is integrated into the Python 3.5 (or later) installation.
  • The creation of a virtual environment and installation of the dependencies in that environment is done running:

python3 -m venv .pyXMLenv
source .pyXMLenv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt

Run the following command to use the script:


usage: parseXMLPubMed.py [-h] -d DATABASE -i INPUT

This program extracts sections and metadata of XML article from EuropePMC

options:
 -h, --help            show this help message and exit
 -d DATABASE, --database DATABASE
                       Required. Database Name where the data will be stored. Not possible to update the database. It should have '.db' sufix
 -i INPUT, --input INPUT
                       Required. Path of the folder where the metadata files (pubmed**n****.xml.gz files) are located

Example:

python3 parseXMLPubMed.py -i exampleData -d exampleDatabase.db

To look at the data I suggest to download the DB Browser for SQLite.

The data is gathered from ftp.ncbi.nlm.nih.gov/pubmed/baseline/.

About

Script to retrieve metadata for the FTP data from the baseline of PubMed

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages