Skip to content

Outputs

Adelme Bazin edited this page Sep 23, 2020 · 38 revisions

PPanGGOLiN provides multiple outputs to describe a pangenome. In this section the different outputs will be described.

In most cases it will provide with a HDF-5 file named "pangenome.h5". This file stores all the information about your pangenome and the analysis that were run. If given to ppanggolin through most of the subcommands, it will read information from it. This is practical as you can regenerate figures or output files, or rerun parts of the analysis without redoing everything.

In this section, each parts will describe a possible output of PPanGGOLiN, and will be commented with the command line that generates it using the HDF5 file, which is assumed to be called 'pangenome.h5'.

When using the same subcommand (like 'write' or 'draw' that can help you generate multiple file each), you can provide multiple options to write all of the file formats that you desire at once.

Figures

U-shaped plot

A U-shaped plot is a figure presenting the number of families (y axis) per number of organisms (x axis). It is a .html file that can be opened with any browser and with which you can interact, zoom, move around, mouseover to see numbers in more detail, and you can save what you are seeing as a .png image file.

It can be generated using the 'draw' subcommand as such :

ppanggolin draw -p pangenome.h5 --ucurve

tile plot

A tile plot is a heatmap representing the gene families (y axis) in the organisms (x axis) making up your pangenome. The tiles on the graph will be colored if the gene family is present in an organism and uncolored if absent. The gene families are ordered by partition, and the genomes are ordered by a hierarchical clustering based on their shared gene families (basically two genomes that are close together in terms of gene family composition will be close together on the figure).

This plot is quite helpful to observe potential structures in your pangenome, and can also help you to identify eventual outliers. You can interact with it, and mousing over a tile in the plot will indicate to you which is the gene identifier(s), the gene family and the organism that corresponds to the tile.

If you build your pangenome using the 'workflow' subcommand and you have more than 500 organisms, only the 'shell' and the 'persistent' partitions will be drawn, leaving out the 'cloud' as the figure tends to be too heavy for a browser to open it otherwise.

It can be generated using the 'draw' subcommand as such :

ppanggolin draw -p pangenome.h5 --tile_plot

and if you do not want the 'cloud' gene families as it is a lot of data and can be hard to open with a browser sometimes, you can use the following option :

ppanggolin draw -p pangenome.h5 --tile_plot --nocloud

Rarefaction curve

This figure is not drawn by default in the 'workflow' subcommand as it requires a lot of computations. It represents the evolution of the number of gene families for each partition as you add more genomes to the pangenome. It has been used a lot in the literature as an indicator of the diversity that you are missing with your dataset on your taxonomic group. The idea is that if at some point when you keep adding genomes to your pangenome you do not add any more gene families, you might have access to your entire taxonomic group's diversity. On the contrary if you are still adding a lot of genes you may be still missing a lot of gene families.

There are 8 partitions represented. For each of the partitions there are multiple representations of the observed data. You can find the observed means, medians, 1st and 3rd quartiles of the number of gene families per number of genome used. And you can find the fitting of the data by the Heaps' law, which is usually used to represent this evolution of the diversity in terms of gene families in each of the partitions.

It can be generated using the 'rarefaction' subcommand, which is dedicated to drawing this graph, as such :

ppanggolin rarefaction -p pangenome.h5

A lot of options can be used with this subcommand to tune your rarefaction curves, they will all be described in another part of this wiki.

Files

Organisms statistics

The organisms_statistics.tsv file is a tab-separated file describing the content of each of the genome used for building the pangenome. It might be useful when working with fragmented data such as MAGs or if you suspect some of your genomes to be chimeric, or to not belong to your taxonomic group (as those genomes will be outliers regarding to the numbers in this file). The first lines starting with a '#' are indicators of parameters used when generating the numbers describing each organisms, and should not be read if loading this into a spreadsheet. They will be skipped automatically if you load this file with R.

This file is made of 15 columns described in the following table

Column Description
organism Indicates the organism's name to whom the provided genome belongs to
nb_families Indicates the number of gene families present in that genome
nb_persistent_families The number of persistent families present in that genome
nb_shell_families The number of shell families present in that genome
nb_cloud_families The number of cloud families present in that genome
nb_exact_core The number of exact core families present in that genome. This number should be identical in all genomes.
nb_soft_core The number of soft core families present in that genome. The threshold used is indicated in the #soft_core line at the beginning of the file, and is 0.95 by default.
nb_genes The number of genes in that genome
nb_persistent_genes The number of genes whose family is persistent in that genome
nb_shell_genes The number of genes whose family is shell in that genome
nb_cloud_genes The number of genes whose family is cloud in that genome
nb_exact_core_genes The number of genes whose family is exact core in that genome
nb_soft_core_genes The number of genes whose family is soft core in that genome
completeness This is an indicator of the proportion of single copy markers in the persistent that are present in the genome. While it is expected to be relatively close to 100 when working with isolates, it may be particularly interesting when working with very fragmented genomes as this provide a de novo estimation of the completess based on the expectation that single copy markers within the persistent should be mostly present in all individuals of the studied taxonomic group
nb_single_copy_markers This indicates the number of present single copy markers in the genomes. They are computed using the parameter duplication_margin indicated at the beginning of the file. They correspond to all of the persistent gene families that are not present in more than one copy in 5% (or more) of the genomes by default.

It can be generated using the 'write' subcommand as such :

ppanggolin write -p pangenome.h5 --stats

This command will also generate the 'mean_persistent_duplication.tsv' file.

pangenomeGraph files

The pangenome's graph can be given through multiple data formats, in order to manipulate it with other softwares.

gexf and light gexf

The Graph can be given through the .gexf and through the _light.gexf files. The _light.gexf file will contain the gene families as nodes and the edges between gene families describing their relationship, and the .gexf file will contain the same thing, but also include more informations about each gene and each relation between gene families. We have made two different files representing the same graph because, while the non-light file is exhaustive, it can be very heavy to manipulate and most of the information in it are not of interest to everyone. The _light.gexf file should be the one you use to manipulate the pangenome graph most of the time.

They can be manipulated and visualised through a software called Gephi, with which we have made extensive testings, or potentially any other softwares or libraries that can read gexf files such as networkx or gexf-js among others.

Using Gephi, the layout can be tuned as illustrated below:

Gephi layout

We advise the Gephi "Force Atlas 2" algorithm to compute the graph layout with "Stronger Gravity: on" and "scaling: 4000" but don't hesitate to tinker the layout parameters.

In the _light.gexf file : The nodes will contain the number of genes belonging to the gene family, the most commun gene name (if you provided annotations), the most common product name(if you provided annotations), the partitions it belongs to, its average and median size in nucleotids, and the number of organisms that have this gene family.

The edges contain the number of times they are present in the pangenome.

The .gexf non-light file will contain in addition to this all the information about genes belonging to each gene families, their names, their product string, their sizes and all the information about the neighborhood relationships of each pair of genes described through the edges.

The light gexf can be generated using the 'write' subcommand as such :

ppanggolin write -p pangenome.h5 --light_gexf

while the gexf file can be generated as such :

ppanggolin write -p pangenome.h5 --gexf

json

The json's file content corresponds to the .gexf file content, but in json rather than gexf file format. It follows the 'node-link' format as shown in this example in javascript, or as used in the networkx python library and it should be usable with both D3js and networkx, or any other software or library that supports this format.

The json can be generated using the 'write' subcommand as such :

ppanggolin write -p pangenome.h5 --json

gene presence absence

This file is basically a presence absence matrix. The columns are the genomes used to build the pangenome, the lines are the gene families. The identifier of the gene family is the gene identifier chosen as a representative. There is a 1 if the gene family is present in a genome, and 0 otherwise. It follows the exact same format than the 'gene_presence_absence.Rtab' file that you get from the pangenome analysis software roary

It can be generated using the 'write' subcommand as such :

ppanggolin write -p pangenome.h5 --Rtab

matrix

This file is a .csv file following a format alike the gene_presence_absence.csv file generated by roary, and works with scoary if you want to do pangenome-wide association studies.

It can be generated using the 'write' subcommand as such :

ppanggolin write -p pangenome.h5 --csv

mean persistent duplication

This file is a .tsv file, with a single parameter written as a comment at the beginning of the file, which indicates the proportion of genomes in which a gene family must be present more than once to be considered 'duplicated' (and not single copy marker). This file lists the gene families, their duplication ratio, their mean presence in the pangenome and whether it is considered a 'single copy marker' or not, which is particularly useful when calculating the completeness recorded in the organisms statistics file described previously.

It can be generated using the 'write' subcommand as such :

ppanggolin write -p pangenome.h5 --stats

This command will also generate the 'organisms_statistics.tsv' file.

partitions

Those files will be stored in the 'partitions' directory and will be named after the partition that they represent (like persistent.txt for the persistent partition). In each of those file there will be a list of gene family identifiers that correspond to the gene families belonging to that partition, one family per line, should you need it for your pipelines or during your analysis.

You can generate those files as such :

ppanggolin write -p pangenome.h5 --partitions

projection

This option writes in a 'projection' directory. There will be a file written in the .tsv file format for every single genome in the pangenome. The columns of this file are described in the following table :

Column Description
gene the unique identifier of the gene
contig the contig that the gene is on
start the start position of the gene
stop the stop position of the gene
strand The strand that the gene is on
ori Will be T if the gene name is dnaA
family the family identifier to which the gene belongs to
nb_copy_in_org The number of copy of the family in the organism (basically, if 1, the gene has no closely related paralogue in that organism)
partition the partition to which the gene family of the gene belongs to
persistent_neighbors The number of neighbors classified as 'persistent' in the pangenome graph
shell_neighbors The number of neighbors classified as 'shell' in the pangenome graph
cloud_neighbors The number of neighbors classidied as 'cloud' in the pangenome graph

Those files can be generated as such :

ppanggolin write -p pangenome.h5 --projection

Gene families and genes

You can write a list containing the gene family assigned to every single gene of your pangenome, in a file format extactly like the one provided by MMseqs2 through its subcommand 'createtsv'. It is basically a three-column file listing the gene family name in the first column, and the gene names in the second. A third column is either empty, or has an "F" in it. In that case it indicates that the gene is potentially a gene fragment and not complete. This will be indicated only if the defragmentation pipeline is used.

You can obtain it as such :

ppanggolin write -p pangenome.h5 --families_tsv

CDS nucleotid sequences

You can get all of the genes nucleotid sequences if you ran your pangenome analysis from .fasta files, or using .gbff/.gbk files. You can get them as such :

ppanggolin write -p pangenome.h5 --all_genes

Families representative sequences

You can get representative sequences for your gene families, either with nucleotids or with amino acid.

If you want nucleotid sequences :

ppanggolin write -p pangenome.h5 --all_gene_families

If you want amino acid sequences :

ppanggolin write -p pangenome.h5 --all_prot_families

Plastic regions

This file is a tsv file that lists all of the detected Regions of Genome Plasticity. This requires to have run the RGP detection analysis by either using the panrgp command or the rgp command.

The file has the following format :

column description
region a unique identifier for the region. This is usually built from the contig it is on, with a number after it
organism the organism it is in. This is the organism name provided by the user.
start the start position of the RGP in the contig
stop the stop position of the RGP in the contig
genes the number of genes included in the RGP
contigBorder this is a boolean column. If the RGP is on a contig border it will be True, otherwise, it will be False. This often can indicate that, if an RGP is on a contig border it is probably not complete.
wholeContig this is a boolean column. If the RGP is an entire contig, it will be True, and False otherwise. If an RGP is an entire contig it can possibly be a plasmid, a region flanked with repeat sequences or a contaminant

Spots

This is a tsv file with two column. It links the spots of 'summarize_spots' with the RGPs of 'plastic_regions'.

column description
spot_id The spot identifier (found in the 'spot' column of 'summarize_spots')
rgp_id the RGP identifier (found in 'region' column of 'plastic_regions')

Summarize spots

This is a tsv file that will associate each spot with multiple metrics that can indicate the dynamic of the spot.

column description
spot the spot identifier. It is unique in the pangenome
nb_rgp the number of RGPs present in the spot
nb_families The number of different gene families that are found in the spot
nb_unique_family_sets The number of RGPs with different gene family content. If two RGPs are identical, they will be counted only once. The difference between this number and the one provided in 'nb_rgp' can be a strong indicator on whether their is a high turnover in gene content in this area or not
mean_nb_genes the mean number of genes on RGPs in the spot
stdev_nb_genes the standard deviation of the number of genes in the spot
max_nb_genes the longest RGP in number of genes of the spot
min_nb_genes the shortest RGP in number of genes of the spot