Skip to content

Latest commit

 

History

History
59 lines (33 loc) · 5.58 KB

README.md

File metadata and controls

59 lines (33 loc) · 5.58 KB

Genetic taxonomy of influenza A virus subtypes

Scripts and output files associated with a manuscript "Prospects for a sequence-based taxonomy of influenza A virus subtypes":
https://www.biorxiv.org/content/10.1101/2023.07.06.548035v2

Input data files can be obtained from Zenodo under a Creative Commons license:
DOI

Scripts

  • chainsaw-plot.R - R script to visualize the number of subtrees produced by the edgewise clustering ("chainsaw") method as a function of the internal branch length cutoff.
    This script is applied to results obtained for protein sequence phylogenies reconstructed for all eight influenza A virus (IAV) genome segments, with an emphasis on hemagglutinin (HA) and neuraminidase (NA) proteins.

  • coldates.R - a simple R script that was used to generate Figure S1 (a barplot of the number of IAV sequences deposited to the GISAID database per year).

  • chainsaw.py - Python script implementing the edgewise clustering method. Requires Biopython. Running the script without any arguments prints a histogram summary of branch lengths to the console. Specifying a branch length cutoff with --cutoff prints a summary of the resulting subtres (defaulting to -f summary). Setting the -f option to labels writes a detailed CSV output listing subtree assignments for all tips. The script also calculates the normalized mutual information between the subtree partition and subtype labels.

  • compress-seqs.py - This Python script looks for exact matches in unaligned sequences of the input FASTA file, and writes the unique sequence to an output FASTA file using the first label encountered. All other duplicate labels are written to a CSV file to link them to the first label. This script also filters out sequences with an excessive number of ambiguous amino acids (X).

  • concat-genes.py - This Python script concatenates the non-overlapping amino acid sequences for M1/M2 or NS1/NS2 records from the same isolate. The input is assumed to be a CDS FASTA file generated by the NCBI Genbank interface.

  • filter-prot.py - This Python script applies an initial filter on the CDS FASTA files downloaded from Genbank. It uses regular expressions to remove records that do not correspond to the query protein.

  • get-metadata.py - The default sequence names for Genbank CDS downloads are not very informative, so this script is used to retrieve more useful metadata such as the strain name and collection date from the database based on the accession number. It takes either a FASTA or NWK file as input. The results are written to a CSV file.

  • midpoint.R - This small R script simpily calls the midpoint rooting function of the phangorn package on the input tree.

  • plot-trees.R - This R script requires the R package ggfree. It generates plots of the large HA and NA phylogenies, colouring branches based on subtype labels on the tips.

  • relabel-fasta.py - This Python script uses the CSV generated by get-metadata.py to replace the sequence names in the user-specified FASTA input file.

  • subtree-grid.R - This R script was used to generate the supplementary figure summarizing the results of node-wise clustering of the HA phylogeny.

  • subtyping.py - This Python script implements the nodewise clustering method, calculating a number of summary statistics for every internal node of the input tree.

Results

  • HA.mindiv0_08.maxpat1_2.subtypes.csv - This CSV file was generated for the HA sequence alignment using the subtyping.py script that implements a nodewise clustering method with minimum divergence (mindiv) set to 0.08 and maximum mean patristic distance (maxpat) set to 1.2. These data were used to generate Supplementary Figure 3D.

  • chainsaw-HA-0.18.labels.csv - This CSV file was generated using the chainsaw.py script for the input tree HA.nwk with the options -f labels and --cutoff 0.18. The results were used to generate the matrix plot comparing subtrees to NA subtype labels (Figure 2B).

  • chainsaw-NA-0.41.labels.csv - This CSV file was generated using the chainsaw.py script for the input tree NA.nwk with the options -f labels and --cutoff 0.41. The results were used to generate the matrix plots comparing subtrees to NA subtype labels (Figure 3B).

  • chainsaw-nsubtrees-na.csv - This CSV was generated by running chainsaw.py for the input tree NA.nwk under varying settings of --cutoff, and recording the number of subtrees listed in the summary outputs (Figure 3A).

  • chainsaw-nsubtrees-others.csv - This CSV was generated by running chainsaw.py for all trees except for HA.nwk and NA.nwk under varying settings of --cutoff, and recording the number of subtrees listed in the summary outputs (Supplementary Figure S4).

  • chainsaw-nsubtrees.csv - This CSV was generated by running chainsaw.py for the input tree HA.nwk under varying settings of --cutoff, and recording the number of subtrees listed in the summary outputs (Figure 2A).

  • edge-index.RData - This file stores some intermediate outputs of the plot-trees.R script.

  • subtree-grid.csv - This CSV file was generated by the subtree-grid.py script for the HA phylogeny with no --minlen or --maxlen option specified. It is used by the script subtree-grid.R to generate Supplementary Figure S3 (A to C).

  • treeplots.RData - This file stores some intermediate outputs of the plot-trees.R script.