-
Notifications
You must be signed in to change notification settings - Fork 31
Outputs
PPanGGOLiN provides multiple outputs to describe a pangenome. In this section the different outputs will be described.
In most cases it will provide with a HDF-5 file named "pangenome.h5". This file stores all the information about your pangenome and the analysis that were run. If given to ppanggolin through most of the subcommands, it will read information from it. This is practical as you can regenerate figures or output files, or rerun parts of the analysis without redoing everything.
In this section, each parts will describe a possible output of PPanGGOLiN, and will be commented with the command line that generates it using the HDF5 file, which is assumed to be called 'pangenome.h5'.
When using the same subcommand (like 'write' or 'draw' that can help you generate multiple file each), you can provide multiple options to write all of the file formats that you desire at once.
A U-shaped plot is a figure presenting the number of families (y axis) per number of organisms (x axis). It is a .html file that can be opened with any browser and with which you can interact, zoom, move around, mouseover to see numbers in more detail, and you can save what you are seeing as a .png image file.
It can be generated using the 'draw' subcommand as such :
ppanggolin draw -p pangenome.h5 --ucurve
A tile plot is a heatmap representing the gene families (y axis) in the organisms (x axis) making up your pangenome. The tiles on the graph will be colored if the gene family is present in an organism and uncolored if absent. The gene families are ordered by partition, and the genomes are ordered by a hierarchical clustering based on their shared gene families (basically two genomes that are close together in terms of gene family composition will be close together on the figure).
This plot is quite helpful to observe potential structures in your pangenome, and can also help you to identify eventual outliers. You can interact with it, and mousing over a tile in the plot will indicate to you which is the gene identifier(s), the gene family and the organism that corresponds to the tile.
If you build your pangenome using the 'workflow' subcommand and you have more than 500 organisms, only the 'shell' and the 'persistent' partitions will be drawn, leaving out the 'cloud' as the figure tends to be too heavy for a browser to open it otherwise.
It can be generated using the 'draw' subcommand as such :
ppanggolin draw -p pangenome.h5 --tile_plot
and if you do not want the 'cloud' gene families as it is a lot of data and can be hard to open with a browser sometimes, you can use the following option :
ppanggolin draw -p pangenome.h5 --tile_plot --nocloud
This figure is not drawn by default in the 'workflow' subcommand as it requires a lot of computations. It represents the evolution of the number of gene families for each partition as you add more genomes to the pangenome. It has been used a lot in the literature as an indicator of the diversity that you are missing with your dataset on your taxonomic group. The idea is that if at some point when you keep adding genomes to your pangenome you do not add any more gene families, you might have access to your entire taxonomic group's diversity. On the contrary if you are still adding a lot of genes you may be still missing a lot of gene families.
There are 8 partitions represented. For each of the partitions there are multiple representations of the observed data. You can find the observed means, medians, 1st and 3rd quartiles of the number of gene families per number of genome used. And you can find the fitting of the data by the Heaps' law, which is usually used to represent this evolution of the diversity in terms of gene families in each of the partitions.
It can be generated using the 'rarefaction' subcommand, which is dedicated to drawing this graph, as such :
ppanggolin rarefaction -p pangenome.h5
A lot of options can be used with this subcommand to tune your rarefaction curves, they will all be described in another part of this wiki.
The organisms_statistics.tsv file is a tab-separated file describing the content of each of the genome used for building the pangenome. It might be useful when working with fragmented data such as MAGs or if you suspect some of your genomes to be chimeric, or to not belong to your taxonomic group (as those genomes will be outliers regarding to the numbers in this file). The first lines starting with a '#' are indicators of parameters used when generating the numbers describing each organisms, and should not be read if loading this into a spreadsheet. They will be skipped automatically if you load this file with R.
This file is made of 15 columns described in the following table
Column | Description |
---|---|
organism | Indicates the organism's name to whom the provided genome belongs to |
nb_families | Indicates the number of gene families present in that genome |
nb_persistent_families | The number of persistent families present in that genome |
nb_shell_families | The number of shell families present in that genome |
nb_cloud_families | The number of cloud families present in that genome |
nb_exact_core | The number of exact core families present in that genome. This number should be identical in all genomes. |
nb_soft_core | The number of soft core families present in that genome. The threshold used is indicated in the #soft_core line at the beginning of the file, and is 0.95 by default. |
nb_genes | The number of genes in that genome |
nb_persistent_genes | The number of genes whose family is persistent in that genome |
nb_shell_genes | The number of genes whose family is shell in that genome |
nb_cloud_genes | The number of genes whose family is cloud in that genome |
nb_exact_core_genes | The number of genes whose family is exact core in that genome |
nb_soft_core_genes | The number of genes whose family is soft core in that genome |
completeness | This is an indicator of the proportion of single copy markers in the persistent that are present in the genome. While it is expected to be relatively close to 100 when working with isolates, it may be particularly interesting when working with very fragmented genomes as this provide a de novo estimation of the completess based on the expectation that single copy markers within the persistent should be mostly present in all individuals of the studied taxonomic group |
nb_single_copy_markers | This indicates the number of present single copy markers in the genomes. They are computed using the parameter duplication_margin indicated at the beginning of the file. They correspond to all of the persistent gene families that are not present in more than one copy in 5% (or more) of the genomes by default. |
It can be generated using the 'write' subcommand as such :
ppanggolin write -p pangenome.h5 --stats
This command will also generate the 'mean_persistent_duplication.tsv' file.
The pangenome's graph can be given through multiple data formats, in order to manipulate it with other softwares.
The Graph can be given through the .gexf and through the _light.gexf files. The _light.gexf file will contain the gene families as nodes and the edges between gene families describing their relationship, and the .gexf file will contain the same thing, but also include more informations about each gene and each relation between gene families. We have made two different files representing the same graph because, while the non-light file is exhaustive, it can be very heavy to manipulate and most of the information in it are not of interest to everyone. The _light.gexf file should be the one you use to manipulate the pangenome graph most of the time.
They can be manipulated and visualised through a software called Gephi, with which we have made extensive testings, or potentially any other softwares or libraries that can read gexf files such as networkx or gexf-js among others.
Using Gephi, the layout can be tuned as illustrated below:
In the _light.gexf file : The nodes will contain the number of genes belonging to the gene family, the most commun gene name (if you provided annotations), the most common product name(if you provided annotations), the partitions it belongs to, its average and median size in nucleotids, and the number of organisms that have this gene family.
The edges contain the number of times they are present in the pangenome.
The .gexf non-light file will contain in addition to this all the information about genes belonging to each gene families, their names, their product string, their sizes and all the information about the neighborhood relationships of each pair of genes described through the edges.
The light gexf can be generated using the 'write' subcommand as such :
ppanggolin write -p pangenome.h5 --light_gexf
while the gexf file can be generated as such :
ppanggolin write -p pangenome.h5 --gexf
The json's file content corresponds to the .gexf file content, but in json rather than gexf file format. It follows the 'node-link' format as shown in this example in javascript, or as used in the networkx python library and it should be usable with both D3js and networkx, or any other software or library that supports this format.
The json can be generated using the 'write' subcommand as such :
ppanggolin write -p pangenome.h5 --json
This file is basically a presence absence matrix. The columns are the genomes used to build the pangenome, the lines are the gene families. There is a 1 if the gene family is present in a genome, and 0 otherwise. It follows the exact same format than the 'gene_presence_absence.Rtab' file that you get from the pangenome analysis software roary
It can be generated using the 'write' subcommand as such :
ppanggolin write -p pangenome.h5 --Rtab
This file is a .csv file following a format alike the gene_presence_absence.csv file generated by roary, and works with scoary if you want to do pangenome-wide association studies.
It can be generated using the 'write' subcommand as such :
ppanggolin write -p pangenome.h5 --csv
This file is a .tsv file, with a single parameter written as a comment at the beginning of the file, which indicates the proportion of genomes in which a gene family must be present more than once to be considered 'duplicated' (and not single copy marker). This file lists the gene families, their duplication ratio, their mean presence in the pangenome and whether it is considered a 'single copy marker' or not, which is particularly useful when calculating the completeness recorded in the organisms statistics file described previously.
It can be generated using the 'write' subcommand as such :
ppanggolin write -p pangenome.h5 --stats
This command will also generate the 'organisms_statistics.tsv' file.
Those files will be stored in the 'partitions' directory and will be named after the partition that they represent (like persistent.txt for the persistent partition). In each of those file there will be a list of gene family identifiers that correspond to the gene families belonging to that partition, one family per line, should you need it for your pipelines or during your analysis.
You can generate those files as such :
ppanggolin write -p pangenome.h5 --partitions
This option writes in a 'projection' directory. There will be a file written in the .tsv file format for every single genome in the pangenome. The columns of this file are described in the following table :
Column | Description |
---|---|
gene | the unique identifier of the gene |
contig | the contig that the gene is on |
start | the start position of the gene |
stop | the stop position of the gene |
strand | The strand that the gene is on |
ori | To be completed ! |
family | the family identifier to which the gene belongs to |
nb_copy_in_org | The number of copy of the family in the organism (basically, if 1, the gene has no closely related paralogue in that organism) |
partition | the partition to which the gene family of the gene belongs to |
persistent_neighbors | The number of neighbors classified as 'persistent' in the pangenome graph |
shell_neighbors | The number of neighbors classified as 'shell' in the pangenome graph |
cloud_neighbors | The number of neighbors classidied as 'cloud' in the pangenome graph |
Those files can be generated as such :
ppanggolin write -p pangenome.h5 --projection
You can write a list containing the gene family assigned to every single gene of your pangenome, in a file format extactly like the one provided by MMseqs2 through its subcommand 'createtsv'. It is basically a two-column file listing the gene family name in the first column, and the gene names in the second.
You can obtain it as such :
ppanggolin write -p pangenome.h5 --families_tsv
You can get all of the genes nucleotid sequences if you ran your pangenome analysis from .fasta files, or using .gbff/.gbk files. You can get them as such :
ppanggolin write -p pangenome.h5 --all_genes
You can get representative sequences for your gene families, either with nucleotids or with amino acid.
If you want nucleotid sequences :
ppanggolin write -p pangenome.h5 --all_gene_families
If you want amino acid sequences :
ppanggolin write -p pangenome.h5 --all_prot_families