-
Notifications
You must be signed in to change notification settings - Fork 30
Outputs
PPanGGOLiN provides multiple outputs to describe a pangenome. In this section the different outputs will be described.
In most cases it will provide with a HDF-5 file named "pangenome.h5". This file stores all the information about your pangenome and the analysis that were run. If given to ppanggolin through most of the subcommands, it will read information from it. This is practical as you can regenerate figures or output files, or rerun parts of the analysis without redoing everything.
In this section, each parts will describe a possible output of PPanGGOLiN, and will be commented with the command line that generates it using the HDF5 file, which is assumed to be called 'pangenome.h5'.
When using the same subcommand (like 'write' or 'draw' that can help you generate multiple file each), you can provide multiple options to write all of the file formats that you desire at once.
A U-shaped plot is a figure presenting the number of families (y axis) per number of organisms (x axis). It is a .html file that can be opened with any browser and with which you can interact, zoom, move around, mouseover to see numbers in more detail, and you can save what you are seeing as a .png image file.
It can be generated using the 'draw' subcommand as such :
ppanggolin draw -p pangenome.h5 --ucurve
A tile plot is a heatmap representing the gene families (y axis) in the organisms (x axis) making up your pangenome. The tiles on the graph will be colored if the gene family is present in an organism and uncolored if absent. The gene families are ordered by partition, and the genomes are ordered by a hierarchical clustering based on their shared gene families (basically two genomes that are close together in terms of gene family composition will be close together on the figure).
This plot is quite helpful to observe potential structures in your pangenome, and can also help you to identify eventual outliers. You can interact with it, and mousing over a tile in the plot will indicate to you which is the gene identifier(s), the gene family and the organism that corresponds to the tile.
If you build your pangenome using the 'workflow' subcommand and you have more than 500 organisms, only the 'shell' and the 'persistent' partitions will be drawn, leaving out the 'cloud' as the figure tends to be too heavy for a browser to open it otherwise.
It can be generated using the 'draw' subcommand as such :
ppanggolin draw -p pangenome.h5 --tile_plot
and if you do not want the 'cloud' gene families as it is a lot of data and can be hard to open with a browser sometimes, you can use the following option :
ppanggolin draw -p pangenome.h5 --tile_plot --nocloud
This figure is not drawn by default in the 'workflow' subcommand as it requires a lot of computations. It represents the evolution of the number of gene families for each partition as you add more genomes to the pangenome. It has been used a lot in the literature as an indicator of the diversity that you are missing with your dataset on your taxonomic group. The idea is that if at some point when you keep adding genomes to your pangenome you do not add any more gene families, you might have access to your entire taxonomic group's diversity. On the contrary if you are still adding a lot of genes you may be still missing a lot of gene families.
There are 8 partitions represented. For each of the partitions there are multiple representations of the observed data. You can find the observed means, medians, 1st and 3rd quartiles of the number of gene families per number of genome used. And you can find the fitting of the data by the Heaps' law, which is usually used to represent this evolution of the diversity in terms of gene families in each of the partitions.
It can be generated using the 'rarefaction' subcommand, which is dedicated to drawing this graph, as such :
ppanggolin rarefaction -p pangenome.h5
A lot of options can be used with this subcommand to tune your rarefaction curves, most of them are the same as with the partition
workflow.
The following 3 are related to the rarefaction alone:
-
--depth
defines the number of sampling for each number of organism (default 30) -
--min
defines the minimal number of organisms in a sample (default 1) -
--max
defines the maximal number of organisms in a sample (default 100)
So for example the following command:
ppanggolin rarefaction -p pangenome.h5 --min 5 --max 50 --depth 30
Will draw a rarefaction curve with sample sizes between 5 and 50 (between 5 and 50 genomes will be used), and with 30 samples at each point (so 30 samples of 5 genomes, 30 samples or 6 genomes ... up to 50 genomes).
For versions 1.2.30 and above, the 'drawspot' command draws specific spots of interest, whose ID are provided, or all the spots if you wish. It will also write a gexf file, which corresponds to the gene families and their organization within the spots. It is basically a subgraph of the pangenome, consisting of the spot itself. The command can be used as such:
ppanggolin drawspot -p pangenome.h5 --spots all
will draw an interactive .html figure and a gexf file for all the spots.
If you are interested in only a single spot, you can use its identifier to draw it, as such:
ppanggolin drawspot -p pangenome.h5 --spots spot_34
for spot_34, for example.
The interactive figures that are drawn look like this:
They can be edited using the sliders and the radio buttons, to change various graphical parameters, and then the plot itself can be saved using the save button on the right of the screen, if need be.
The organisms_statistics.tsv file is a tab-separated file describing the content of each of the genome used for building the pangenome. It might be useful when working with fragmented data such as MAGs or if you suspect some of your genomes to be chimeric, or to not belong to your taxonomic group (as those genomes will be outliers regarding to the numbers in this file). The first lines starting with a '#' are indicators of parameters used when generating the numbers describing each organisms, and should not be read if loading this into a spreadsheet. They will be skipped automatically if you load this file with R.
This file is made of 15 columns described in the following table
Column | Description |
---|---|
organism | Indicates the organism's name to whom the provided genome belongs to |
nb_families | Indicates the number of gene families present in that genome |
nb_persistent_families | The number of persistent families present in that genome |
nb_shell_families | The number of shell families present in that genome |
nb_cloud_families | The number of cloud families present in that genome |
nb_exact_core | The number of exact core families present in that genome. This number should be identical in all genomes. |
nb_soft_core | The number of soft core families present in that genome. The threshold used is indicated in the #soft_core line at the beginning of the file, and is 0.95 by default. |
nb_genes | The number of genes in that genome |
nb_persistent_genes | The number of genes whose family is persistent in that genome |
nb_shell_genes | The number of genes whose family is shell in that genome |
nb_cloud_genes | The number of genes whose family is cloud in that genome |
nb_exact_core_genes | The number of genes whose family is exact core in that genome |
nb_soft_core_genes | The number of genes whose family is soft core in that genome |
completeness | This is an indicator of the proportion of single copy markers in the persistent that are present in the genome. While it is expected to be relatively close to 100 when working with isolates, it may be particularly interesting when working with very fragmented genomes as this provide a de novo estimation of the completess based on the expectation that single copy markers within the persistent should be mostly present in all individuals of the studied taxonomic group |
nb_single_copy_markers | This indicates the number of present single copy markers in the genomes. They are computed using the parameter duplication_margin indicated at the beginning of the file. They correspond to all of the persistent gene families that are not present in more than one copy in 5% (or more) of the genomes by default. |
It can be generated using the 'write' subcommand as such :
ppanggolin write -p pangenome.h5 --stats
This command will also generate the 'mean_persistent_duplication.tsv' file.
The pangenome's graph can be given through multiple data formats, in order to manipulate it with other softwares.
The Graph can be given through the .gexf and through the _light.gexf files. The _light.gexf file will contain the gene families as nodes and the edges between gene families describing their relationship, and the .gexf file will contain the same thing, but also include more informations about each gene and each relation between gene families. We have made two different files representing the same graph because, while the non-light file is exhaustive, it can be very heavy to manipulate and most of the information in it are not of interest to everyone. The _light.gexf file should be the one you use to manipulate the pangenome graph most of the time.
They can be manipulated and visualised through a software called Gephi, with which we have made extensive testings, or potentially any other softwares or libraries that can read gexf files such as networkx or gexf-js among others.
Using Gephi, the layout can be tuned as illustrated below:
We advise the Gephi "Force Atlas 2" algorithm to compute the graph layout with "Stronger Gravity: on" and "scaling: 4000" but don't hesitate to tinker the layout parameters.
In the _light.gexf file : The nodes will contain the number of genes belonging to the gene family, the most commun gene name (if you provided annotations), the most common product name(if you provided annotations), the partitions it belongs to, its average and median size in nucleotids, and the number of organisms that have this gene family.
The edges contain the number of times they are present in the pangenome.
The .gexf non-light file will contain in addition to this all the information about genes belonging to each gene families, their names, their product string, their sizes and all the information about the neighborhood relationships of each pair of genes described through the edges.
The light gexf can be generated using the 'write' subcommand as such :
ppanggolin write -p pangenome.h5 --light_gexf
while the gexf file can be generated as such :
ppanggolin write -p pangenome.h5 --gexf
The json's file content corresponds to the .gexf file content, but in json rather than gexf file format. It follows the 'node-link' format as shown in this example in javascript, or as used in the networkx python library and it should be usable with both D3js and networkx, or any other software or library that supports this format.
The json can be generated using the 'write' subcommand as such :
ppanggolin write -p pangenome.h5 --json
This file is basically a presence absence matrix. The columns are the genomes used to build the pangenome, the lines are the gene families. The identifier of the gene family is the gene identifier chosen as a representative. There is a 1 if the gene family is present in a genome, and 0 otherwise. It follows the exact same format than the 'gene_presence_absence.Rtab' file that you get from the pangenome analysis software roary
It can be generated using the 'write' subcommand as such :
ppanggolin write -p pangenome.h5 --Rtab
This file is a .csv file following a format alike the gene_presence_absence.csv file generated by roary, and works with scoary if you want to do pangenome-wide association studies.
It can be generated using the 'write' subcommand as such :
ppanggolin write -p pangenome.h5 --csv
This file is a .tsv file, with a single parameter written as a comment at the beginning of the file, which indicates the proportion of genomes in which a gene family must be present more than once to be considered 'duplicated' (and not single copy marker). This file lists the gene families, their duplication ratio, their mean presence in the pangenome and whether it is considered a 'single copy marker' or not, which is particularly useful when calculating the completeness recorded in the organisms statistics file described previously.
It can be generated using the 'write' subcommand as such :
ppanggolin write -p pangenome.h5 --stats
This command will also generate the 'organisms_statistics.tsv' file.
Those files will be stored in the 'partitions' directory and will be named after the partition that they represent (like persistent.txt for the persistent partition). In each of those file there will be a list of gene family identifiers that correspond to the gene families belonging to that partition, one family per line, should you need it for your pipelines or during your analysis.
You can generate those files as such :
ppanggolin write -p pangenome.h5 --partitions
This option writes in a 'projection' directory. There will be a file written in the .tsv file format for every single genome in the pangenome. The columns of this file are described in the following table :
Column | Description |
---|---|
gene | the unique identifier of the gene |
contig | the contig that the gene is on |
start | the start position of the gene |
stop | the stop position of the gene |
strand | The strand that the gene is on |
ori | Will be T if the gene name is dnaA |
family | the family identifier to which the gene belongs to |
nb_copy_in_org | The number of copy of the family in the organism (basically, if 1, the gene has no closely related paralog in that organism) |
partition | the partition to which the gene family of the gene belongs to |
persistent_neighbors | The number of neighbors classified as 'persistent' in the pangenome graph |
shell_neighbors | The number of neighbors classified as 'shell' in the pangenome graph |
cloud_neighbors | The number of neighbors classidied as 'cloud' in the pangenome graph |
Those files can be generated as such :
ppanggolin write -p pangenome.h5 --projection
You can write a list containing the gene family assigned to every single gene of your pangenome, in a file format extactly like the one provided by MMseqs2 through its subcommand 'createtsv'. It is basically a three-column file listing the gene family name in the first column, and the gene names in the second. A third column is either empty, or has an "F" in it. In that case it indicates that the gene is potentially a gene fragment and not complete. This will be indicated only if the defragmentation pipeline is used.
You can obtain it as such :
ppanggolin write -p pangenome.h5 --families_tsv
You can get all of the genes nucleotid sequences if you ran your pangenome analysis from .fasta files, or using .gbff/.gbk files. You can get them as such :
ppanggolin write -p pangenome.h5 --all_genes
This option is not available anymore from 1.1.98 and on. The command ppanggolin fasta
should be used instead (see Fasta )
You can get representative sequences for your gene families, either with nucleotids or with amino acid.
If you want nucleotid sequences :
ppanggolin write -p pangenome.h5 --all_gene_families
If you want amino acid sequences :
ppanggolin write -p pangenome.h5 --all_prot_families
Those options are not available anymore from 1.1.98 and on. The command ppanggolin fasta
should be used instead (see Fasta )
This file is a tsv file that lists all of the detected Regions of Genome Plasticity. This requires to have run the RGP detection analysis by either using the panrgp
command or the rgp
command.
The file has the following format :
column | description |
---|---|
region | a unique identifier for the region. This is usually built from the contig it is on, with a number after it |
organism | the organism it is in. This is the organism name provided by the user. |
start | the start position of the RGP in the contig |
stop | the stop position of the RGP in the contig |
genes | the number of genes included in the RGP |
contigBorder | this is a boolean column. If the RGP is on a contig border it will be True, otherwise, it will be False. This often can indicate that, if an RGP is on a contig border it is probably not complete. |
wholeContig | this is a boolean column. If the RGP is an entire contig, it will be True, and False otherwise. If an RGP is an entire contig it can possibly be a plasmid, a region flanked with repeat sequences or a contaminant |
This is a tsv file with two column. It links the spots of 'summarize_spots' with the RGPs of 'plastic_regions'.
column | description |
---|---|
spot_id | The spot identifier (found in the 'spot' column of 'summarize_spots') |
rgp_id | the RGP identifier (found in 'region' column of 'plastic_regions') |
This is a tsv file that will associate each spot with multiple metrics that can indicate the dynamic of the spot.
column | description |
---|---|
spot | the spot identifier. It is unique in the pangenome |
nb_rgp | the number of RGPs present in the spot |
nb_families | The number of different gene families that are found in the spot |
nb_unique_family_sets | The number of RGPs with different gene family content. If two RGPs are identical, they will be counted only once. The difference between this number and the one provided in 'nb_rgp' can be a strong indicator on whether their is a high turnover in gene content in this area or not |
mean_nb_genes | the mean number of genes on RGPs in the spot |
stdev_nb_genes | the standard deviation of the number of genes in the spot |
max_nb_genes | the longest RGP in number of genes of the spot |
min_nb_genes | the shortest RGP in number of genes of the spot |
This command is available from 1.1.98 and on. This command can be used to write fasta sequences of the pangenome or of parts of the pangenome. Most options expect a partition to write. Available partitions are:
- 'all' for the entire pangenome.
- 'Persistent' for persistent families
- 'Shell' for shell genes or families
- 'Cloud' for cloud genes or families
- 'rgp' for genes or families found in RGPs
- 'core' for core genes or families
- 'softcore' for softcore genes or families
When using the 'softcore' filter, the '--soft_core' option can be used to modily the threshold used to determine what is part of the soft core. It is at 0.95 by default.
This option can be used to write the nucleotide CDS sequences. It can be used as such, to write all of the genes of the pangenome for example:
ppanggolin fasta -p pangenome.h5 --output MY_GENES --genes all
Or to write only the persistent genes:
ppanggolin fasta -p pangenome.h5 --output MY_GENES --genes persistent
This option can be used to write the protein sequences of the representative sequences for each family. It can be used as such for all families:
ppanggolin fasta -p pangenome.h5 --output MY_PROT --prot_families all
or for all of the shell families for example:
ppanggolin fasta -p pangenome.h5 --output MY_PROT --prot_families shell
This option can be used to write the gene sequences of the representative sequences for each family. It can be used as such:
ppanggolin fasta -p pangenome.h5 --output MY_GENES --gene_families all
or for the cloud families for example:
ppanggolin fasta -p pangenome.h5 --output MY_GENES --gene_families cloud
This option can be used to write the nucleotide sequences of the detected RGPs. It requires the fasta sequences that were used to compute the pangenome as they were provided originally when you computed your pangenome. This command only has two filters:
- all, for all regions
- complete, for only the 'complete' regions which are not on a contig border
It can be used as such:
ppanggolin fasta -p pangenome.h5 --output MYREGIONS --regions all --fasta organisms.fasta.list
This command is available from 1.1.103 and on. It is used to call mafft with default options to compute MSA of any partition of the pangenome. Using multiple cpus is recommended as it is quite demanding in computational resources.
By default it will write the strict 'core' (genes that are present in absolutely all genomes) and remove any duplicated genes. Beware however that, if you have many genomes (over 1000), the core will likely be either very small or even empty if you have fragmented genomes.
It will write one MSA for each family. You can then provide the directory where the MSA are written to IQ-TREE for example, to do phylogenetic analysis.
You can change the partition which is written, by using the --partition option.
ppanggolin msa -p pangenome.h5 --partition persistent
for example will compute MSA for all the persistent gene families.
Supported partitions are core, persistent, shell, cloud, softcore, accessory. If you wish to have additional filters, you can raise an issue with your demand, or write a PR directly, most possibilites should be quite straightforward to add.
You can specify whether to use dna or protein sequences for the MSA by using --source. It uses protein sequences by default.
ppanggolin msa -p pangenome.h5 --source dna
It is also possible to write a single whole genome MSA file, which many phylogenetic softwares accept as input, by using the --phylo option as such:
ppanggolin msa -p pangenome.h5 --phylo
This will contatenate all of the family MSA into a single MSA, with one sequence for each genome.
When computing a pangenome, all of the information about it is saved in the .h5 file, notably parameters used at each step and metrics about the pangenome. You can easily retrieve those informations using the 'info' module.
This command prints information on stdout, and does not write any file.
This option indicates the following metrics about your pangenome, if they have been computed:
- The total number of genes
- The number of genomes
- The number of gene families
- The number of edges in the pangenome graph
- The number of persistent genes, with the minimal, maximal, sd and mean presence thresholds of the families in this partition
- The number of shell genes, with the minimal, maximal, sd and mean presence thresholds of the families in this partition
- The number of cloud genes, with the minimal, maximal, sd and mean presence thresholds of the families in this partition
- The number of partitions
Additionally, if you have used the 'panrgp' workflow, or the 'rgp' and 'spot' modules, you will have the following metrics:
- The number of RGPs (Regions of Genomic Plasticity)
- The number of spots of insertion
It is used as such:
ppanggolin info -p pangenome.h5 --content
This option indicates, for each steps of the analysis, the PPanGGOLiN parameters that were used and the source of the data if appropriate.
It is used as such:
ppanggolin info -p pangenome.h5 --parameters