aMGSIM is an extension of MGSIM to create simulated ancient metagenome reads in multiple microbial synthetic communities. It integrates the methods in Gargammel and provides flexibility to create different experimental scenarios. Furthermore, aMGSIM also can track damage over the codons of the predicted proteins from the microbial genomes.
aMGSIM can produce random metagenomes (A) or it can estimate the parameters from a real sample (B) using bam-filter and metaDMG. We provide a Snakemake workflow example here to generate many random metagenomes from the results from processing real ancient metagenomes bam-filter and metaDMG.
We recommend having conda installed to manage the virtual environments
First, we create a conda virtual environment with:
wget https://raw.githubusercontent.com/genomewalker/aMGSIM/master/environment.yml
conda env create -f environment.yml
Then we proceed to install using pip:
pip install aMGSIM
By cloning in a dedicated conda environment
git clone [email protected]:genomewalker/aMGSIM.git
cd aMGSIM
conda env create -f environment.yml
conda activate aMGSIM
pip install -e .
aMGSIM incorporates three new subcommands to generate synthetic metagenomic reads:
- estimate: Estimate coverage, depth and other properties for each genome in sample processed with filterBAM
- ancient-genomes: Estimate coverage, depth and other properties for each genome in each synthetic community
- ancient-reads: Simulate ancient reads for each taxon in each synthetic community
- protein-analysis: Tracking damage to the codon positions of each simulated read.
- add-duplicates: Add duplicates to the fastq files using the models from Rochette et al. 2022
You can access the list of commands by:
$ aMGSIM --help
usage: aMGSIM [-h] [--version] {communities,estimate,multicov,ancient-genomes,ancient-reads,protein-analysis} ...
A simple tool to simulate ancient metagenomes for multiple synthetic communities
positional arguments:
{communities,estimate,multicov,ancient-genomes,ancient-reads,protein-analysis,add-duplicates}
positional arguments
communities This will call the MGSIM communities command to generate random taxon
abundances
estimate Estimate coverage, depth and other properties for each genome in a sample
processed with bam-filter
multicov Create different coverage versions of a genome compositions file
ancient-genomes Estimate coverage, depth and other properties for each genome in each
synthetic community
ancient-reads Simulate ancient reads for each taxon in each synthetic community
protein-analysis Tracking damage to the codon positions of each simulated read
add-duplicates Add duplicates to the fastq files using the models from Rochette et al. 2022
optional arguments:
-h, --help show this help message and exit
--version Print program version
We have three different approaches to generate the synthetic communities:
- We can use empirical results from filterBAM and metaDMG as shown here with the subcommand
estimate
. - We can use the subcommand
communities
from MGSIM - We can use the
multicov
subcommand to create different coverage versions of a genome compositions file generated by 1) or by the subcommandancient-genomes
.
We recommend to use the estimate subcommand to generate the synthetic communities.
To use the empirical data from filterBAM and metaDMG, you will need the following files:
- A file containing the paths to the genomes used for mapping the reads. A TSV file with the
name
andlocation of the genomes
fasta files like the file genome-paths-list.tsv - A file containing the mapping stats after filtering the BAM files with bam-filter.
- The metaDMG CSV ouput for the sample and the misincorporation file.
- The taxonomy dump files (names and nodes) used for filterBAM and metaDMG.
Then you can use the command as follows:
aMGSIM estimate f2b-config.yaml
And example of the config file can be found here [**aw-config.yaml associated with the files described above are:
# TSV file with the name of the genome and the fasta location of the references used in filterBAM, the file has no header and has the following format: name\tfasta_location
genome_paths: genome-paths-list.tsv
# Nodes and names dumps used for the filterBAM analysis
names_dmp: names-mdmg.dmp
nodes_dmp: nodes-mdmg.dmp
# Sample name
sample_name: "0e59dd2155"
# Statistics of the mapping results from the filterBAM analysis
filterBAM_stats: 0e59dd2155.filterBAM.dedup_stats-filtered.tsv.gz
# Filter options for the filetrBAM table
filterBAM_filter_conditions: { "breadth_exp_ratio": 0.5, "coverage_mean": 0.1 }
# filterBAM abundance table
filterBAM_profiling_results: 0e59dd2155.filterBAM-none.tsv.gz
# Results of the metaDMG analysis
mdmg_results: 0e59dd2155.filterBAM-mdmg.weight-0.csv.gz
# Filter conditions to identify damaged references
mdmg_filter_conditions: { "d_max": 0.1, "phi": 500, "q": 0.25 }
# Taxonomic scope of the metaDMG analysis
taxonomic_rank: "species"
# Number of non-damaged genomes
max_genomes_nondamaged: 10
# How to select the non-damaged genomes: random, most_abundant, least_abundant
max_genomes_nondamaged_selection: "random"
# Number of damaged genomes
max_genomes_damaged: 10
# How to select the non-damaged genomes: random, most_abundant, least_abundant
max_genomes_damaged_selection: "random"
# Should we reestimate the abundances after the genome selection?
use_restimated_proportions: True
# Number of cpus. [Default: 1]
cpus: 16
This will create all the files need for the next step:
- 0e59dd2155.communities.tsv: This file contains the estimated abundance table for the synthetic community. Columns are
Community
,Taxon
,Rank
andPerc_rel_abund
- 0e59dd2155.filepaths.tsv: This file contains the file paths for the reference genomes used in the filterBAM analysis. Columns are
Taxon
,TaxId
,Accession
andFasta
- 0e59dd2155.genome-compositions.tsv: This file is the one controlling how the genomes are going to be processed and will contain the empirical values. We can change them to spike certain taxa to very high or very low coverage values. Anything added to this file (all genomes by default) will avoid the random generation of coverage, sequencing depth and other vlaues. Columns are
Taxon
,Community
,Coverage
andonlyAncient
If you want to use the communities
subcommand:
aMGSIM communities --n-comm 3 examples/data/genome_list.txt example
where genome_table.tsv requires the following columns:
- Taxon: taxon name
- Fasta: genome fasta file path
The subcommand communities
will generate three synthetic communities. From the MGSIM documentation, the description of the files generated:
- example_abund.txt: taxon relative abundances for each community this is the relative number of genome copies for each taxon
- example__wAbund.txt: taxon relative abundances weighted by genome size this is the fraction of the DNA pool for each taxon
- example__beta-div.txt: beta diversity among communities see the beta-div parameter for selecting beta-diversity measures
When these files are in place we can call the second command that will generate the synthetic community. This command will use the config file ag-config.yaml
will set the basic parameters to generate the fragment needed for the simulations. Here is where you need to decide if we use paired/single end, which machine, read length, modal length of the damaged/non-damaged fraction and the libprep method. We will use the subcommand ancient-genomes
:
$ aMGSIM ancient-genomes -h
ancient-genomes: Estimate coverage, depth and other properties for each genome in each synthetic community
Usage:
ancient-genomes [options] <config>
ancient-genomes -h | --help
ancient-genomes --version
Options:
<config> Config parameters
-d --debug Debug mode (no subprocesses; verbose output)
-h --help Show this screen.
--version Show version.
Description:
Simulating ancient reads for each taxon in each synthetic community
config
------
* tab-delimited
* must contain 3 columns
* "Community" = community ID (ie., sample ID)
* "Taxon" = taxon name
* "Perc_rel_abund" = percent relative abundance of the taxon
Output
------
* A JSON file with the read properties for the selected ancient genomes
* A TSV file with the read abundances for the ancient/modern genomes in each
synthetic community
The subcommand takes as input a YAML config file with the different parameters used to create the set of genomes in each synthetic community. A description of the different parameters can be found here. Some of the essential parameters are genome_table
(the same used as in the subcommand communities
); abund_table
the abundance table created with the subcommand communities
or we can define our values creating a table like:
Community Taxon Perc_rel_abund Rank
1 Escherichia_coli_K-12_MG1655 59.052261390 1
1 Methanosarcina_barkeri_MS 32.277831367 2
1 Clostridium_perfringens_ATCC_13124 8.669907243 3
2 Escherichia_coli_K-12_MG1655 84.638859282 1
2 Methanosarcina_barkeri_MS 13.605601555 2
2 Clostridium_perfringens_ATCC_13124 1.755539163 3
3 Escherichia_coli_K-12_MG1655 67.034789353 1
3 Methanosarcina_barkeri_MS 29.298568225 2
3 Clostridium_perfringens_ATCC_13124 3.666642422 3
And the genome_comp
where we can define for each sample which genome will be ancient and the coverage. This file can be used to spike-in specific genomes at known coverages in the a sample. Here you can find an example of the file. It is a TSV file with four columns:
Taxon Community Coverage onlyAncient
Escherichia_coli_K-12_MG1655 1 1.0 True
Methanosarcina_barkeri_MS 2 0.5 False
Clostridium_perfringens_ATCC_13124 3 0 None
The column onlyAncient
can take one of the following values:
- True: The genome in this sample will only contain ancient reads with the coverage specified, 1.0 in the example
- False: The genome in this sample will contain a mixture of modern and ancient reads. The ancient coverage will be the one specified in the file, 0.5 in the example.
- None: The genome in this sample will only contain modern reads.
The values of coverage will be always downsized to fit the maximum coverage allowed by the proportion of the taxon in the sample, the number of reads and the size of the genome. If onlyAncient
is True
and the Coverage
value exceeds the maximum coverage allowed, the Coverage
will be set to the maximum coverage allowed. In the case where onlyAncient
is False
and the Coverage
value exceeds the maximum coverage allowed, the Coverage
will be set to a random value defined by the limits defined in the config file by the coverage
parameter. In case the random value is higher than the maximum coverage allowed it will take the maximum coverage allowed value.
The subcommand ancient-genomes
also creates the fragment size distribution for each genome in each sample. At the moment aMGSIM only accepts lognormal
as the distribution where to sample the fragment lengths. We can specify the mode of the distribution for modern (mode-len-modern
) and ancient (mode-len-ancient
) and aMGSIM will estimate the parameters to achieve an approximate distribution of the fragment lenghts. The estimates values will be used by fragSim
downstream.
The output of this subcommand will produce a TSV file (PREFIX_read-abundances.tsv
) with the number of reads that will be generated for each taxon in each sample; and a JSON file with the details to generate the synthetic data of each taxon in each sample. The JSON produced will be used by the subcommand ancient-reads
and has the following structure:
{
"data": [
{
"comm": 2,
"taxon": "Clostridium_perfringens_ATCC_13124",
"rel_abund": 1.7555391630000001,
"genome_size": 3256683,
"onlyAncient": true,
"fragments_ancient": {
"fragments": {
"length": [
30,
31,
32,
33,
34,
...
115,
116,
117,
120,
122
],
"freq": [
0.022599003395321643,
0.026239989701394066,
0.030409852333009715,
0.0337998101194531,
0.037248770335723824,
...
2.1455429028122703e-06,
1.0727714514061352e-06,
2.1455429028122703e-06,
1.0727714514061352e-06,
2.1455429028122703e-06
]
},
"dist_params": {
"mu": 3.740959125904414,
"sigma": 0.22820971011435498,
"rnd_seed": 27473
},
"avg_len": 43.92603991782571,
"seq_depth": 87776,
"seq_depth_original": 80713,
"fold": 2.344874217109863,
"fold_original": 2.1562061766527476
},
"fragments_modern": null,
"coverage_enforced": false,
"seq_depth": 87776
}
],
"experiment": {
"library": "pe",
"read_length": 150,
"seqSys": "HS25",
"n_reads": 5000000.0,
"date": "2021-04-12 07:49:09"
}
}
In case we would like to generate the necessary files to have defined coverage values for each genome in the compositions file we can use the subcommand multicov
:
$ aMGSIM multicov --help
usage: aMGSIM multicov [-h] [--debug] [--suffix STR] [--coverages COVERAGES] genome_compositions
optional arguments:
-h, --help show this help message and exit
--debug Print debug messages (default: False)
required arguments:
genome_compositions A TSV file containing the genome compositions
optional arguments:
--suffix STR Suffix added to the filename for the multi coverage file (default: multicov)
--coverages COVERAGES
Coverage values for the multi coverage (default: [0.1, 0.5, 1, 5, 10, 20, 50, 100])
And to generate teh followin coverages (0.1,1,10,100) for each genome in the compositions file, we could do:
aMGSIM multicov --coverage 0.1,1,10,100 0e59dd2155.genome-compositions.tsv
The command will generate all the files that can be used by ancient-reads as well.
The subcommand ancient-reads
is the one creating all the fasta and fastq files with the reads for the genomes in each community:
ancient-reads: Simulate ancient reads for each taxon in each synthetic
community
Usage:
ancient-reads [options] <genome_table> <config>
ancient-reads -h | --help
ancient-reads --version
Options:
<genome_table> Taxon genome info.
<config> Config parameters
-h --help Show this screen.
-d --debug Debug mode (no subprocesses; verbose output)
--version Show version.
Description:
Simulating ancient reads for each taxon in each synthetic community
genome_table
------------
* tab-delimited
* must contain 2 columns
* "Taxon" = taxon name
* "Fasta" = genome fasta file path
* other columns are allowed
config
------
* YAML with config parameters
(Check https://github.com/genomewalker/aMGSIM for details)
Output
------
* A set of read files for each sample (fragSim, deamSim, ART)
- read sequences are named by the taxon they originate from
- directory structure: OUTPUT_DIR/COMMUNITY/ancient-read_files
* A JSON file with the location of each file type (fragSim, deamSim, ART)
It takes as input the file with the genome information used by the subcommand communities
and a config file that specifies the details related to the tools in gargammel. A description of the different parameters can be found here. The config file has five main categories:
- global: Here one can configure general parameters like the location of the executables, the number of cpus to use or the library preparation
- fragSim: Here one can specify the parameters for fragSim and where one can use the JSON produced by the
ancient-genomes
subcommand - deamSim: One can define the propertied for the damage, it can be a
misincorporation
file from mapDamage or the parameters for the Briggs model - adptSim: Where to set the adapters to be used
- art: One can specify custom error models in case they are not included in ART
The ouput of ancient-reads
are the read files for each community after each step (fragSim -> deamSim -> adptSim -> ART) and a JSON file with the location of each file. This JSON file can be used for the subcommand in aMGSIM, protein-analysis
.
This subcommand will predict the genes in the genome with Prodigal and will match the reads and damages to the codon positions in the predicted genes. Depending the number of reads simulated it can take some time to produce all the results.
protein-analysis: Tracking damage to the codon positions of each simulated read
usage: aMGSIM protein-analysis [-h] [--debug] [--threads THREADS] [--processes PROCESSES] [--output STR]
[--tmp STR] [--gene-min-length GENE_MIN_LENGTH]
abund_read_file
optional arguments:
-h, --help show this help message and exit
--debug Print debug messages (default: False)
required arguments:
abund_read_file A JSON file containing the abundance of reads
optional arguments:
--threads THREADS Number of threads to use in each process (default: 1)
--processes PROCESSES
Number of processes to use (default: 1)
--output STR Output folder (default: protein-analysis)
--tmp STR Temporary folder (default: .tmp)
--gene-min-length GENE_MIN_LENGTH
Minimum gene length predicts by Prodigal (default: 30)
One can run the subcommand like this:
aMGSIM protein-analysis --threads 3 --processes 16 example_abund_read-files.json
Here, protein-analysis
will start 3 jobs and use 16 cores per each job, and will produce a rich TSV with the following columns:
- index: 14
- Read numnber
- chromosome: Methanosarcina_barkeri_MS__seq-0
- Genome/contig where the read and gene are originated
- End: 2114466
- End position of the read/gene intersection
- End_gene: 2114583
- End position of the gene in the genome/contig
- End_read: 2114466
- End position of the read in the genome/contig
- read_name: 1___Methanosarcina_barkeri_MS__seq-0---1027401:ancient:+:2114436:2114466:30:4
- Name of the read
- Overlap: 30
- Overlap between the gene and the read
- Start: 2114436
- Start position of the read/gene intersection
- Start_gene: 2114416
- Start position of the gene in the genome/contig
- Start_read: 2114436
- Start position of the read in the genome/contig
- Strand_gene: +
- Strand of the predicted gene
- Strand_read: +
- Strand of the simulated read
- damage: 4
- Damage position in the read
- damage_aaseq_diffs: 1:L>F
- Amino acid differences owing to the damage (Original>Damaged)
- damage_codon_diffs: 3:CTT>TTT:L>F
- Codon and amino acid differences owing to the damage (Original>Damaged)
- damage_inframe_ag:
- Damage position in the open reading frame for A-to-G (0-index)
- damage_inframe_ct: 3
- Damage position in the open reading frame for C-to-T (0-index)
- damage_intersection: 4
- Damage position in the intersection between read/gene in frame (1-index)
- damaged_seq: AACTTTTACAAATGTGAGATAAAAAATAGT
- Damage sequence in present in read
- damaged_seq_inframe_aa: NFYKCEIKN
- Damaged amino acid sequence (in frame)
- damaged_seq_inframe_nt: AACTTTTACAAATGTGAGATAAAAAAT
- Damaged nucleotide sequence (in frame)
- damaged_seq_length: 30
- Length of the damaged sequence
- gene_name: Methanosarcina_barkeri_MS__seq-0_1774
- Name of the predicted gene
- intersect_length: 30
- Length of the read/gene intersection
- intersect_seq: AACCTTTACAAATGTGAGATAAAAAATAGT
- Sequence of the intersection
- intersect_seq_inframe_aa: NLYKCEIKN
- Amino acid sequence of the intersection without damage
- intersect_seq_inframe_nt: AACCTTTACAAATGTGAGATAAAAAAT
- Nucleotide sequence of the intersection without damage
- nondamaged_seq: AACCTTTACAAATGTGAGATAAAAAATAGT
- Nucleotide non-damaged sequence
- read_length: 30
- Read length
- sequence: ATGCCTGAAGTAGAAGTCAAGAACCTTTACAAATGTGAGATAAAAAATAGTGAGATAAAAAATAGTGAGATAAAAAATAGTGAGATAAAAAATAGTGAGATAAAAAATAGTGAGATAAAAAATAGTGAGATAAAAAATAGTGAGATAAAAAATAGGTCTAAAAATTGA
- Gene nucletide sequence
- sequence_aa: MPEVEVKNLYKCEIKNSEIKNSEIKNSEIKNSEIKNSEIKNSEIKNSEIKNRSKN*
- Gene amino acid sequence
- type: ancient
- Type of read
- community: 1
- Community where it belongs
This subcommand will add duplicates to the simulated reads. The number of duplicates is based on the models described in Rochelle et al. 2022. It takes as an input one of the fastQ files produced by the ancient-reads
subcommand (or any fastQ) and a TSV table giving the number of duplication clones of each size that were observed. For generating the input check the decoratio package here. The subcommand uses the modules from decoratio to generate the duplicates. The output is a fastQ file with the duplicates added.
usage: aMGSIM add-duplicates [-h] [--debug] -f FASTQ [-c CLONE_SIZE_FREQS] [--quiet] [-t THREADS]
[-o TSV_OUT] [-q FASTQ_OUT] [--model MODEL]
optional arguments:
-h, --help show this help message and exit
--debug Print debug messages (default: False)
required arguments:
-f FASTQ, --fastq FASTQ
First input file in FASTQ format (default: None)
optional arguments:
-c CLONE_SIZE_FREQS, --clone-size-freqs CLONE_SIZE_FREQS
File name for clone size frequencies. Headers are 'clone_size n_clones'
(default: None) (default: None)
--quiet Suppress progress bar (default: False)
-t THREADS, --threads THREADS
Number of threads to use for parallel processing (default: 1)
-o TSV_OUT, --tsv-out TSV_OUT
Output file name (default: output.tsv.gz) (default: output.tsv.gz)
-q FASTQ_OUT, --fastq-out FASTQ_OUT
Output file name (default: output.fq.gz) (default: output.tsv.gz)
Decoratio arguments:
--model MODEL An amplification model specifier such as "logskewnormal" (default),
"lognormal", or "inheff:12". See MODEL SPECIFICATION section. (default:
logskewnormal)
One can run the subcommand like this:
aMGSIM add-duplicates --threads 3 --fastq example_ancient_reads.fastq.gz --clone-size-freqs example_duplication_table.tsv
See LICENSE
Please cite the original MGSIM