Skip to content

PHI-base/phibase-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PHI-base pipeline

Python package and command-line application for cleaning and releasing data from version 5 of the Pathogen-Host Interactions Database (PHI-base). Currently supported release formats include:

  • a JSON format that combines data from version 4 and version 5 of PHI-base, and

  • several tabular export formats that are intended for loading by the Ensembl databases.

⚠️ Note that this package is a work in progress. Features still to be added include a tabular release format that contains all data from version 4 and version 5 of PHI-base, and support for querying UniProtKB as part of the release process.

Installation

Install the latest release from GitHub:

python -m pip install 'phibase_pipeline@git+https://github.com/PHI-base/[email protected]'

Or install the latest commit on the main branch:

python -m pip install 'phibase_pipeline@git+https://github.com/PHI-base/phibase-pipeline.git@main'

Usage

JSON release format

To generate a cleaned and validated version of the spreadsheet that contains the PHI-base 4 dataset, use the following command:

python -m phibase_pipeline zenodo PHIBASE_CSV CANTO_JSON OUTFILE

Explanation of arguments:

  • PHIBASE_CSV: the path to an export of data from PHI-base version 4, stored in a CSV file. These files can be downloaded from the PHI-base/data repository.

  • CANTO_JSON: the path to an export of approved curation sessions from the PHI-Canto curation tool, stored in a JSON file.

  • OUTFILE: the destination path for the combined JSON file produced by the pipeline.

Ensembl release format

To generate CSV files that can be loaded into the Ensembl interactions database, run the following command:

python -m phibase_pipeline ensembl PHIBASE_CSV CANTO_JSON UNIPROT_DATA DIR

The command will produce three CSV files in the directory specified by DIR:

  • phibase4_interactions_export.csv: an export of interactions from PHI-base version 4.

  • phibase5_interactions_export.csv: an export of interactions from curation sessions in the PHI-Canto curation tool.

  • phibase_amr_export.csv: an export of interactions between pathogen genes and antimicrobial chemicals. Currently these interactions are sourced from curation done with the PHI-Canto curation tool.

Explanation of arguments:

  • PHIBASE_CSV: the path to an export of data from PHI-base version 4, stored in a CSV file. These files can be downloaded from the PHI-base/data repository.

  • CANTO_JSON: the path to an export of approved curation sessions from the PHI-Canto curation tool, stored in a JSON file.

  • UNIPROT_DATA: the path to a file containing data about genes and proteins retrieved from the UniProt Knowledgebase (UniProtKB). This file is created by downloading the results of a query to the UniProtKB ID mapping service as TSV format (see the UniProt data file format section).

  • DIR: the destination directory for the CSV files created by the pipeline.

UniProt data file format

The file passed to the UNIPROT_DATA command-line argument expects the following column names, in the following order:

  • From: the UniProtKB accession number used in the ID mapping query.
  • Entry: the UniProtKB accession number for the protein.
  • Organism: the scientific name of the organism to which the protein belongs.
  • Organism (ID): the NCBI Taxonomy ID of the organism.
  • Taxonomic lineage (Ids): the taxonomic lineage of the organism, containing ID numbers and ranks.
  • Ensembl: a gene ID from the Ensembl Genomes database.
  • EnsemblBacteria: a gene ID from the Ensembl Bacteria database.
  • EnsemblFungi: a gene ID from the Ensembl Fungi database.
  • EnsemblMetazoa: a gene ID from the Ensembl Metazoa database.
  • EnsemblPlants: a gene ID from the Ensembl Plants database.
  • EnsemblProtists: a gene ID from the Ensembl Protists database.

To generate a valid file, use the UniProtKB ID mapping service to query one or more UniProtKB accession numbers. The ‘From database’ should be ‘UniProtKB AC/ID’ and the ‘To database’ should be ‘UniProtKB’ (this is the default setting).

Then use the Download link on the results page, and use the ‘Customize columns’ field to set the following columns in the following order:

  • Organism
  • Organism (ID)
  • Taxonomic lineage (Ids)
  • Ensembl
  • EnsemblBacteria
  • EnsemblFungi
  • EnsemblMetazoa
  • EnsemblPlants
  • EnsemblProtists

Alternatively, use the following URL, replacing the {id} placeholder with the ID of your ID mapping job.

https://rest.uniprot.org/idmapping/uniprotkb/results/stream/{id}?fields=accession%2Corganism_name%2Corganism_id%2Clineage_ids%2Cxref_ensembl%2Cxref_ensemblbacteria%2Cxref_ensemblfungi%2Cxref_ensemblmetazoa%2Cxref_ensemblplants%2Cxref_ensemblprotists&format=tsv

Note that the stream endpoint returns chunks of 500 at a time, and requires pagination if more than 500 results are expected. See here for more instructions.

License

The phibase_pipeline package is distributed under the terms of the MIT license.

About

Pipeline for releasing data from PHI-base version 5

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages