Skip to content

Detailed explanation on how to setup database for NCBI IgBlast executable, and using it in Python

License

Notifications You must be signed in to change notification settings

xinyu-dev/igblast

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Using IgBlast with Python

Since IgBlast documentation does not have great details on the database setup, and there are a lot of questions on the forum regarding to the issue internal_data could not be found or the Germline annotation database...could not be found in [internal_data] directory, I wrote this short guide to in case it might be helpful.

This first part of this guide applies to installation of IgBlast and IMGT human germline sequence database. The second part applies to calling the IgBlast executable inside Jupyter notebook and converting some of the results to Pandas dataframe.

This repository contains executables and data files from ncbi-igblast-1.17.0-x64-macosx.tar.gz (latest Mac version as of Jan 9, 2021). It also has the IGHV, IGKV, and IGLV germline protein database obtained from IMGT (latest as of Jan 9, 2021).

Download IgBlast

Download executable here: ftp://ftp.ncbi.nih.gov/blast/executables/igblast/release/LATEST

Mac: Download the ncbi-igblast-1.17.0-x64-macosx.tar.gz or the latest version(NOT the dmg file) and unzip it in your destination folder.

Then:

  1. Create jupyter notebook inside the igblast folder.
  2. Create an empty folder called database inside the igblast folder

It should have a structure similar to what's shown below:

----igblast
  --bin
  --internal_data
  --optional_file
  --ChangeLog
  --LICENSE
  --ncbi_pakacage_info
  --README
  --database
  --Using IgBlast.ipynb

Download IMGT germline sequence

  1. Germline seuqence can be found here. Both nucleic acid and protein sequences are available. If you

Note that the for blastp (protein blast), only V region alignment is provided by IgBlast.

For example, if you want to download the amino acid germline sequence for human IgHV, you can move to this section on the page, and click on Human under F+ORF+in-frame P. Note that that you can also select Human under F+ORF+in-frame P with IMGT gaps (sequence with gaps), but after cleaning the results will be the same.

pic

  1. Once you click on the species in the page above, you will be directed to a page with a list of fasta sequences. Copy the fasta sequences.

For example, if we want to set up a database for human IGKV germline sequence, we would copy all fasta sequences by locating IGKV row and column F+ORF+in-frame P in table above, then click on the Human link. As of now, there is 108 sequences (example of the first 2 are shown below), and you need to copy all of them.

>V01577|IGKV1-12*01|Homo sapiens|F|V-REGION|1360..1646|287 nt|1| | | |95 AA|95+0=95| | |
DIQMTQSPSSVSASVGDRVTITCRASQGISSWLAWYQQKPGKAPKLLIYAASSLQSGVPS
RFSGSGSGTDFTLTISSLQPEDFATYYCQQANSFP
>V01576|IGKV1-12*02|Homo sapiens|F|V-REGION|1361..1647|287 nt|1| | | |95 AA|95+0=95| | |
DIQMTQSPSSVSASVGDRVTITCRASQGISSWLAWYQQKPGKAPKLLIYAASSLQSGVPS
......
......
  1. Inside the database folder you created in the 1st step, create another folder called Homo_sapiens and Homo_sapiens_clean Inside each of these 2 folders, create a folder called IG_dna and a folder called IG_prot. See screenshot below:

These folders are empty for now.

  1. Inside the IG_prot folder under Homo_sapiens folder, create a text file and paste the fasta sequences of IGKV in there.

Do NOT keep the .txt or .fasta extension for the text file. (otherwise Igblast automatically adds other extensions after your current extension, and your database name will look weird).

Here is what the directory structure looks like. Note that the text files do not have any extension (.txt or .fasta)

Here is a screenshot of what the IGKV text file looks like:

  1. If you have any DNA germline sequence, you can use similar method and copy them to IG_dna folder

Cleaning the IMGT germline sequence files

You can execute the command through terminal, or using the python script below.

Method 1: Use terminal

Open a terminal at the igblast folder. Then type

bin/edit_imgt_file.pl database/Homo_sapiens/IG_prot/IGKV > database/Homo_sapiens_clean/IG_prot/IGHV_clean

The above line cleans up the IGKV text file you just made, and save the cleaned file inside Homo_sapiens_clean/IG_prot folder.

After cleaning, the id, name, and description of the original IMGT fasta sequences are truncated to keep only the allele type. An example of cleaned sequence:

Method 2: Use python

# cleaning up IGKV IMGT germline sequence file
input_imgt_ref = 'database/Homo_sapiens/IG_prot/IGKV'
output_imgt_ref = 'database/Homo_sapiens_clean/IG_prot/IGKV_clean'

cmd = ['bin/edit_imgt_file.pl', input_imgt_ref , '>', output_imgt_ref]

# display result (and any error) in notebook. No file will be saved
# subprocess.run(cmd.split(), capture_output=True)

# save cmd output to file
fh = open(output_imgt_ref, 'w')
subprocess.Popen(cmd, stdout=fh)

Make Database

We need to parse the cleaned germline sequence file to database. There are 2 ways to do it. In terminal or in Python.

The basic syntax for DNA and protein seuqence is:

bin/makeblastdb -parse_seqids -dbtype nucl -in my_seq_file
bin/makeblastdb -parse_seqids -dbtype prot -in my_seq_file

Method 1: Use terminal

To parse the IGKV protein germline seuqences, open a terminal at the igblast folder. Then type

bin/makeblastdb -parse_seqids -dbtype prot -in database/Homo_sapiens_clean/IG_prot/IGKV_clean

Igblast will create a variety of files in the same directory as the cleaned germline sequence text file. A screenshot is shown below:

Method 2: Use Python

cmd = ['bin/makeblastdb', '-parse_seqids', '-dbtype', 'prot', '-in', output_imgt_ref]

subprocess.Popen(cmd, stdout=subprocess.PIPE)

Repeat the process for other germline sequence database. Create one database for VH, one for VL, one for VK...

Parse your sequence

Details of different parameters and options can be found using:

bin/iglbastp -help
bin/igblastn -help

Below is an example output of bin/igblastp -help

USAGE
  igblastp [-h] [-help] [-germline_db_V germline_database_name]
    [-num_alignments_V int_value] [-germline_db_V_seqidlist filename]
    [-organism germline_origin] [-domain_system domain_system]
    [-ig_seqtype sequence_type] [-focus_on_V_segment] [-extend_align5end]
    [-extend_align3end] [-min_V_length Min_V_Length] [-db database_name]
    [-dbsize num_letters] [-entrez_query entrez_query] [-query input_file]
    [-out output_file] [-evalue evalue] [-word_size int_value]
    [-gapopen open_penalty] [-gapextend extend_penalty] [-searchsp int_value]
    [-sum_stats bool_value] [-matrix matrix_name] [-threshold float_value]
    [-ungapped] [-lcase_masking] [-query_loc range] [-parse_deflines]
    [-outfmt format] [-show_gis] [-num_descriptions int_value]
    [-num_alignments int_value] [-line_length line_length]
    [-num_threads int_value] [-remote] [-version]

DESCRIPTION
   BLAST for Ig and TCR sequences

OPTIONAL ARGUMENTS
 -h
   Print USAGE and DESCRIPTION;  ignore all other parameters
 -help
   Print USAGE, DESCRIPTION and ARGUMENTS; ignore all other parameters
 -version
   Print version number;  ignore other arguments

 *** Input query options
 -query <File_In>
   Input file name
   Default = `-'
 -query_loc <String>
   Location on the query sequence in 1-based offsets (Format: start-stop)

 *** General search options
 -db <String>
   Optional additional database name
 -out <File_Out>
   Output file name
   Default = `-'
 -evalue <Real>
   Expectation value (E) threshold for saving hits
   Default = `1'
 -word_size <Integer, >=2>
   Word size for wordfinder algorithm
 -gapopen <Integer>
   Cost to open a gap
 -gapextend <Integer>
   Cost to extend a gap
 -matrix <String>
   Scoring matrix name (normally BLOSUM62)
 -threshold <Real, >=0>
   Minimum word score such that the word is added to the BLAST lookup table

 *** Formatting options
 -outfmt <String>
   alignment view options:
     3 = Flat query-anchored, show identities,
     4 = Flat query-anchored, no identities,
     7 = Tabular with comment lines
     19 = Rearrangement summary report (AIRR format)

   Options 7 can be additionally configured to produce
   a custom format specified by space delimited format specifiers.
   The supported format specifiers are:
            qseqid means Query Seq-id
               qgi means Query GI
              qacc means Query accesion
           qaccver means Query accesion.version
              qlen means Query sequence length
            sseqid means Subject Seq-id
         sallseqid means All subject Seq-id(s), separated by a ';'
               sgi means Subject GI
            sallgi means All subject GIs
              sacc means Subject accession
           saccver means Subject accession.version
           sallacc means All subject accessions
              slen means Subject sequence length
            qstart means Start of alignment in query
              qend means End of alignment in query
            sstart means Start of alignment in subject
              send means End of alignment in subject
              qseq means Aligned part of query sequence
              sseq means Aligned part of subject sequence
            evalue means Expect value
          bitscore means Bit score
             score means Raw score
            length means Alignment length
            pident means Percentage of identical matches
            nident means Number of identical matches
          mismatch means Number of mismatches
          positive means Number of positive-scoring matches
           gapopen means Number of gap openings
              gaps means Total number of gaps
              ppos means Percentage of positive-scoring matches
            frames means Query and subject frames separated by a '/'
            qframe means Query frame
            sframe means Subject frame
              btop means Blast traceback operations (BTOP)
            staxid means Subject Taxonomy ID
          ssciname means Subject Scientific Name
          scomname means Subject Common Name
        sblastname means Subject Blast Name
         sskingdom means Subject Super Kingdom
           staxids means unique Subject Taxonomy ID(s), separated by a ';'
                         (in numerical order)
         sscinames means unique Subject Scientific Name(s), separated by a ';'
         scomnames means unique Subject Common Name(s), separated by a ';'
        sblastnames means unique Subject Blast Name(s), separated by a ';'
                         (in alphabetical order)
        sskingdoms means unique Subject Super Kingdom(s), separated by a ';'
                         (in alphabetical order)
            stitle means Subject Title
        salltitles means All Subject Title(s), separated by a '<>'
           sstrand means Subject Strand
             qcovs means Query Coverage Per Subject
           qcovhsp means Query Coverage Per HSP
            qcovus means Query Coverage Per Unique Subject (blastn only)
   When not provided, the default value is:
   'qseqid sseqid pident length mismatch gapopen gaps qstart qend sstart send
   evalue bitscore', which is equivalent to the keyword 'std'
   Default = `3'
 -show_gis
   Show NCBI GIs in deflines?
 -num_descriptions <Integer, >=0>
   Number of database sequences to show one-line descriptions for
   Not applicable for outfmt > 4
   Default = `10'
 -num_alignments <Integer, >=0>
   Number of database sequences to show alignments for
   Default = `10'
 -line_length <Integer, >=1>
   Line length for formatting alignments
   Not applicable for outfmt > 4
   Default = `60'

 *** Query filtering options
 -lcase_masking
   Use lower case filtering in query and subject sequence(s)?

 *** Restrict search or results
 -entrez_query <String>
   Restrict search with the given Entrez query
    * Requires:  remote

 *** Statistical options
 -dbsize <Int8>
   Effective length of the database
 -searchsp <Int8, >=0>
   Effective length of the search space
 -sum_stats <Boolean>
   Use sum statistics

 *** Extension options
 -ungapped
   Perform ungapped alignment only?

 *** Ig-BLAST options
 -germline_db_V <String>
   Germline database name
 -num_alignments_V <Integer>
   Number of Germline sequences to show alignments for
   Default = `3'
 -germline_db_V_seqidlist <String>
   Restrict search of germline database to list of SeqIds's
 -organism <String>
   The organism for your query sequence. Supported organisms include human,
   mouse, rat, rabbit and rhesus_monkey for Ig and human and mouse for TCR.
   Custom organism is also supported but you need to supply your own germline
   annotations (see IgBLAST web site for details)
   Default = `human'
 -domain_system <String, `imgt', `kabat'>
   Domain system to be used for segment annotation
   Default = `imgt'
 -ig_seqtype <String, `Ig', `TCR'>
   Specify Ig or T cell receptor sequence
   Default = `Ig'
 -focus_on_V_segment
   Should the search only be for V segment (effective only for non-germline
   database search using -db option)?
 -extend_align5end
   Extend V gene alignment at 5' end
 -extend_align3end
   Extend J gene alignment at 3' end
 -min_V_length <Integer, >=9>
   Minimal required V gene length
   Default = `9'

 *** Miscellaneous options
 -parse_deflines
   Should the query and subject defline(s) be parsed?
 -num_threads <Integer, >=1>
   Number of threads (CPUs) to use in the BLAST search
   Default = `4'
    * Incompatible with:  remote
 -remote
   Execute search remotely?
    * Incompatible with:  num_threads

Assign path to environment variable IGDATA

This will help to solve the issue of [internal_data] cannot be found. THe IGDATA should:

  1. point to the bin folder, not the internal_data folder.
  2. use the full path (e.g. in mac /Users/mightycamole/Desktop/DataScience/Python/Work/igblast/bin) and NOT a relative path
  3. The path should not contain any spaces (change your directory name if it does). You might as well avoid _ and - in any of the folder names as well.
  4. Use the export option to assign the environment variable. If you use conda, don't create environment variable inside a specific conda enviroment or modify the envs state file. Directly use the export option described below.

Method: Open a terminal at the igblast folder. Then type

export IGDATA=/Users/mightycamole/Desktop/DataScience/Python/Work/igblast/bin

Replace the file path with your file path

Parse your sequence into Pandas Dataframe

Suppose you have a folder called test inside igblast. The folder contains Rituximab VH and VL sequence in fasta format. Below are 3 examples of parsing:

  1. In Example 1, we parse query seuqence in terminal.
  2. In Example 2, the code will parse the closest germline matching hits into pandas dataframe.

Example:

df, top_germ_allele, top_identity = blastp_get_top_hits_v(
    input_fp = 'scratch/Rituximab-VH.fasta',
    db_fp= 'database/Homo_sapiens_clean/IG_prot/IGHV_clean'
)
print('best match germline: ', top_germ_allele)
print('identity %: ', top_identity)
df

Output:

  1. In Example 3, the code will try to determine if a light chain sequence is a kappa or lambda type.

Example:

df, top_germ_allele, top_identity = blastp_multiple_hits_v(
    input_fp = 'scratch/Rituximab-VL.fasta',
    db_lst= ['database/Homo_sapiens_clean/IG_prot/IGKV_clean','database/Homo_sapiens_clean/IG_prot/IGLV_clean']
)
print('best match germline: ', top_germ_allele)
print('identity %: ', top_identity)
df

Output:

See jupyter notebook for details on the examples

About

Detailed explanation on how to setup database for NCBI IgBlast executable, and using it in Python

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published