Skip to content

Commit

Permalink
Add CONJscan annotations to plasmid summary
Browse files Browse the repository at this point in the history
  • Loading branch information
apcamargo committed Aug 22, 2022
1 parent fe55c94 commit ec9dd3f
Show file tree
Hide file tree
Showing 3 changed files with 28 additions and 13 deletions.
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Compute marker enrichment in the `marker-classification` module.
- Add columns for plasmid and virus marker enrichment to the `_plasmid_summary.tsv` and `_virus_summary.tsv` files.
- Set `--min-plasmid-marker-enrichment` and `--min-virus-marker-enrichment` to `0` as default. This will alter the results when using default parameters.
- Add support for plasmid and virus hallmarks. Requires geNomad datavase v1.1.
- Add support for plasmid and virus hallmarks. Requires geNomad database v1.1.
- Add CONJscan annotations to `_plasmid_summary.tsv`. Requires geNomad database v1.1.

### Changed
- `Sequence` class: simplify `has_dtr` return statement.
Expand Down
17 changes: 10 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -139,7 +139,7 @@ gene start end length strand gc_
NC_002695.2|provirus_300073_325822_264 300073 301047 975 -1 0.476 11 AGGAG/GGAGG NA NA NA 0 0 0 1 NA NA NA NA
NC_002695.2|provirus_300073_325822_265 301423 301812 390 -1 0.392 11 GGA/GAG/AGG NA NA NA 0 0 0 1 NA NA NA NA
NC_002695.2|provirus_300073_325822_266 301940 302653 714 -1 0.461 11 None GENOMAD.222303.VP 4.577e-09 59 0 0 0 2561 Caudoviricetes NA PF06223.15;K10762 Minor tail protein T
NC_002695.2|provirus_300073_325822_267 302754 302954 201 1 0.428 11 AGGAGG GENOMAD.061471.VV 1.653e-15 71 0 0 1 2561 Caudoviricetes NA PF09048.13;TIGR03339;K22302;COG1609 Cro
NC_002695.2|provirus_300073_325822_267 302754 302954 201 1 0.428 11 AGGAGG GENOMAD.061471.VV 1.653e-15 71 0 0 1 2561 Caudoviricetes NA PF09048.13;TIGR03339;K22302;COG1609 Cro
NC_002695.2|provirus_300073_325822_268 303073 303366 294 1 0.503 11 AGGA/GGAG/GAGG GENOMAD.129061.VV 5.21e-33 123 0 0 1 2561 Caudoviricetes NA PF05269.14;TIGR00721 Bacteriophage CII protein
```
Expand All @@ -162,7 +162,7 @@ The columns in this file are:
- `virus_hallmark`: Whether the marker assigned to this gene represents a virus hallmark.
- `taxid`: Taxonomic identifier of the marker assigned to this gene (you can ignore this as it is meant to be used internally by geNomad).
- `taxname`: Name of the taxon associated with the assigned geNomad marker. In this example, we can see that the annotated proteins are all characteristic of *Caudoviricetes* (which is why the provirus was assigned to this class).
- `annotation_conjscan`: If the marker that matched the gene matches a conjugation-related gene (as defined in [CONJscan](https://link.springer.com/protocol/10.1007/978-1-4939-9877-7_19)) this field will show which CONJscan acession was assigned to the marker.
- `annotation_conjscan`: If the marker that matched the gene is a conjugation-related gene (as defined in [CONJscan](https://link.springer.com/protocol/10.1007/978-1-4939-9877-7_19)) this field will show which CONJscan acession was assigned to the marker.
- `annotation_accessions`: Some of the geNomad markers are functionally annotated. This column tells you which entries in Pfam, TIGRFAM, COG, and KEGG were assigned to the marker.
- `annotation_description`: A text describing the function assigned to the marker.

Expand All @@ -174,11 +174,14 @@ The other two virus-related files within the summary directory are `GCF_00000886

Enough with viruses. What about the plasmids?

As you would expect, the data pertaining to the identification of plasmids can be found in the `<prefix>_plasmid_summary.tsv`, `<prefix>_genes.tsv`, `<prefix>_plasmid.fna`, and `<prefix>_plasmid_proteins.faa` files. These are mostly very similar to their virus counterparts. The only difference is that `<prefix>_plasmid_summary.tsv` (shown below) doesn't have the virus-specific columns that are in `<prefix>_virus_summary.tsv` (`coordinates` and `topology`).
As you would expect, the data pertaining to the identification of plasmids can be found in the `<prefix>_plasmid_summary.tsv`, `<prefix>_genes.tsv`, `<prefix>_plasmid.fna`, and `<prefix>_plasmid_proteins.faa` files. These are mostly very similar to their virus counterparts. The differences in `<prefix>_plasmid_summary.tsv` (shown below) are the following:

- Virus-specific columns that are in `<prefix>_virus_summary.tsv` (`coordinates` and `topology`) are not present.
- The `conjugation_genes` column lists genes that might involved in conjugation. It's important to note that the presence of such genes is not sufficient to tell whether a given plasmid is conjugative or mobilizible. If you are interested in identifying conjugative plasmids, we recommend you to analyze the plasmids you identified using geNomad with [CONJscan](https://link.springer.com/protocol/10.1007/978-1-4939-9877-7_19).

```
seq_name length topology n_genes genetic_code plasmid_score fdr n_hallmarks marker_enrichment
----------- ------ -------- ------- ------------ ------------- --- ----------- -----------------
NC_002128.1 92721 Linear 88 11 0.9942 NA 5 46.4458
NC_002127.1 3306 Linear 3 11 0.9913 NA 1 1.6586
seq_name length topology n_genes genetic_code plasmid_score fdr n_hallmarks marker_enrichment conjugation_genes
----------- ------ -------- ------- ------------ ------------- --- ----------- ----------------- -----------------
NC_002128.1 92721 Linear 88 11 0.9942 NA 5 46.4458 T_virB11;MOBP1
NC_002127.1 3306 Linear 3 11 0.9913 NA 1 1.6586 NA
```
21 changes: 16 additions & 5 deletions genomad/modules/summary.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import itertools
import sys
from collections import defaultdict

import numpy as np
from genomad import sequence, utils
Expand Down Expand Up @@ -511,6 +512,7 @@ def main(
)

with console.status("Writting gene annotation data."):
conjscan_genes_dict = defaultdict(list)
with open(outputs.summary_plasmid_genes_output, "w") as fout1, open(
outputs.summary_virus_genes_output, "w"
) as fout2:
Expand All @@ -527,9 +529,12 @@ def main(
for line in utils.read_file(
outputs.annotate_genes_output, skip_header=True
):
if line.split("\t")[0].rsplit("_", 1)[0] in plasmid_name_set:
seq_name = line.split("\t")[0].rsplit("_", 1)[0]
if seq_name in plasmid_name_set:
fout1.write(line)
elif line.split("\t")[0].rsplit("_", 1)[0] in virus_name_set:
if line.split("\t")[16] != "NA":
conjscan_genes_dict[seq_name].append(line.split("\t")[16])
elif seq_name in virus_name_set:
fout2.write(line)
if include_provirus:
for line in utils.read_file(
Expand Down Expand Up @@ -575,7 +580,7 @@ def main(
with open(outputs.summary_plasmid_output, "w") as fout:
fout.write(
"seq_name\tlength\ttopology\tn_genes\tgenetic_code\tplasmid_score\t"
"fdr\tn_hallmarks\tmarker_enrichment\n"
"fdr\tn_hallmarks\tmarker_enrichment\tconjugation_genes\n"
)
for seq_name, score, fdr in itertools.zip_longest(
plasmid_name_array,
Expand All @@ -595,12 +600,18 @@ def main(
)
n_hallmarks = n_hallmarks[0]
marker_enrichment = f"{marker_enrichment[1]:.4f}"
conjugation_genes = conjscan_genes_dict.get(seq_name, [])
if len(conjugation_genes):
conjugation_genes = ";".join(conjugation_genes)
else:
conjugation_genes = "NA"
else:
marker_enrichment = "NA"
n_hallmarks = "NA"
conjugation_genes = "NA"
fout.write(
f"{seq_name}\t{length}\t{topology}\t{n_genes}\t{genetic_code}\t"
f"{score}\t{fdr}\t{n_hallmarks}\t{marker_enrichment}\n"
f"{seq_name}\t{length}\t{topology}\t{n_genes}\t{genetic_code}\t{score}\t"
f"{fdr}\t{n_hallmarks}\t{marker_enrichment}\t{conjugation_genes}\n"
)

# Write virus summary file
Expand Down

0 comments on commit ec9dd3f

Please sign in to comment.