-
Notifications
You must be signed in to change notification settings - Fork 3
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #81 from PeteHaitch/add_strand
Strand information is now recorded in output as either '+' (forward strand), '-' (reverse strand) or '*' (not applicable). Removed --strand-specific parameter and replaced with --strand-collapse, which effectively has the opposite meaning.
- Loading branch information
Showing
6 changed files
with
209 additions
and
157 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,6 +5,7 @@ | |
`comethylation` is a methylation caller for methylation events that co-occur on the same DNA fragment from high-throughput bisulfite sequencing data, such as `methylC-seq`. A typical read from such an experiment reports a binary methylated or unmethylated measurement at multiple loci, where each read originates from a single cell. `comethylation` allows us to investigate the co-occurence of methylation marks at the level of a single cell. | ||
|
||
# Installation and dependencies | ||
|
||
Simply running | ||
|
||
``` | ||
|
@@ -25,6 +26,7 @@ A 2-tuple (m = 2) is a pair of methylation loci; `comethylation` tabulates the n | |
For a chosen value of _m_, all _m-tuples_ are extracted and tabulated across the genome. The key point is that _m-tuples_ are only constructed if all _m_ loci are within a single read, so we can investigate the co-occurence of methylation events at the level of a single cell. | ||
|
||
## Basic usage | ||
|
||
`comethylation` processes a single `BAM` file and works for both single-end and paired-end sequencing data. Example `BAM` files from single-end directional and paired-end directional bisulfite-sequencing experiments are available in the `data/` directory. | ||
|
||
Methylation measurements may be filtered by base quality or other criteria such as the mapping quality of the read or whether the read is marked as a PCR duplicate. For a full list of filtering options, please run `comethylation --help` or see the __Advanced Usage__ section below. | ||
|
@@ -34,18 +36,21 @@ Currently, the BAM file must have been created with [Bismark](http://www.bioinfo | |
The main options to pass `comethylation` are the size of the m-tuple (`-m`); the type of methylation, which is some combination of _CG_, _CHG_, _CHH_ and _CNN_ (`--methylation-type`); any filters to be applied to reads or positions within reads (see below); the BAM file; and the sample name, which will be used as a prefix for all output files. Multiple methylation types may be specified jointly, e.g., `--methylation-type CG --methylation-type CHG` | ||
|
||
## Output | ||
|
||
Three output files are created and summary information is written to `STDOUT`. The main output file is a tab-delimited file of all m-tuples (`<in>.<--methylation-type>.<-m>.tsv`), where `<in>` is the prefix of the `<in.bam>` BAM file. | ||
|
||
Here are the 5 rows (including with the header row) from `data/se_directional.fq.gz_bismark_bt2.CG.2.tsv`, which is created by running the single-end directional example shown below: | ||
Here are the first 5 rows (including with the header row) from `data/se_directional.fq.gz_bismark_bt2.CG.2.tsv`, which is created by running the single-end directional example shown below: | ||
|
||
``` | ||
chr pos1 pos2 MM MU UM UU | ||
chr1 3079983 3079993 0 0 1 0 | ||
chr1 6387768 6387783 1 0 0 0 | ||
chr1 6790760 6790796 0 0 0 1 | ||
chr1 6790796 6790841 0 0 1 0 | ||
chr strand pos1 pos2 MM MU UM UU | ||
chr1 + 6387768 6387783 1 0 0 0 | ||
chr1 + 7104116 7104139 1 0 0 0 | ||
chr1 + 7104139 7104152 1 0 0 0 | ||
chr1 + 9256170 9256179 0 0 0 1 | ||
``` | ||
So, for example, at the CpG 2-tuple chr1:(3,079,983, 3,079,993) we observed 1 read that was unmethylated at chr1:3,079,983 and methylated at chr1:3,079,993. | ||
So, for example, at the CpG 2-tuple chr1:+:(6,387,768, 6,387,783) we observed 1 read that was methylated at chr1:+:6,387,768 and methylated at chr1:+:6,387,783. | ||
|
||
The `strand` is recorded as `+` (forward strand, "OT" in Bismark), `-` (reverse strand, "OB" in Bismark) or `*`, meaning not applicable, if the `--strand-collapse` option is set. The position of all methylation loci is always with respect to the forward strand. | ||
|
||
The second file (`<in>.<--methylation-type>_per_read.hist`) is a text histogram of the number of methylation loci per read or readpair (of the type specified by `--methylation-type`) that passed the filters specified at runtime of `comethylation`. | ||
|
||
|
@@ -83,12 +88,14 @@ SRR921747.81_JONAS:2105:C0G2AACXX:4:2205:18836:56196_length=101 read has indel | |
So, both of these reads were filtered out because they contained indels (see section "__Limitations and notes__"). | ||
|
||
## Examples | ||
|
||
Two small example datasets are included in the `data/` directory. Included are the `FASTQ` files and the `BAM` files generated with __Bismark__ in __Bowtie2__ mode. More details of the example datasets can be found in `data/README.md` | ||
|
||
Although the example datasets are both from directional bisulfite-sequencing protocols, `comethylation` works with data from both directional and non-directional bisulfite-sequencing protocols. | ||
|
||
|
||
### Single-end | ||
|
||
The following command will extract all CpG 2-tuples from the file `data/se_directional.bam`: | ||
|
||
``` | ||
|
@@ -102,12 +109,12 @@ This results in 3 files: | |
* `data/se_directional.fq.gz_bismark_bt2.reads_that_failed_QC.txt` | ||
|
||
### Paired-end | ||
|
||
Paired-end data must firstly be sorted by queryname prior to running `comethylation`. `BAM` files created by Bismark, such as `data/pe_directional.bam`, are already sorted by queryname. So, to extract all CG/CHH 3-tuples we would simply run: | ||
|
||
``` | ||
comethylation -m 3 --methylation-type CG --methylation-type CHH --strand-specific data/pe_directional_1.fq.gz_bismark_bt2_pe.bam | ||
comethylation -m 3 --methylation-type CG --methylation-type CHH data/pe_directional_1.fq.gz_bismark_bt2_pe.bam | ||
``` | ||
Note that we had to specify `--strand-specific` because __CHH__ methylation is not symmetric across the Watson and Crick strands. | ||
|
||
This results in 3 files: | ||
|
||
|
@@ -125,12 +132,13 @@ samtools sort data/pe_directional_1.fq.gz_bismark_bt2_pe.bam data/cs_pe_directio | |
# Re-sort the coordinate-sorted BAM by queryname | ||
SortSam I=data/cs_pe_directional_1.fq.gz_bismark_bt2_pe.bam O=data/qs_pe_directional_1.fq.gz_bismark_bt2_pe SO=queryname | ||
# Run comethylation on the queryname sorted BAM | ||
comethylation -m 3 --methylation-type CG --methylation-type CHG --strand-specific data/qs_pe_directional_1.fq.gz_bismark_bt2_pe.bam | ||
comethylation -m 3 --methylation-type CG --methylation-type CHG data/qs_pe_directional_1.fq.gz_bismark_bt2_pe.bam | ||
``` | ||
|
||
__Note:__ Please use `SortSam` from the `Picard` library rather than `samtools sort`; `SortSam` guarantees that `read_1` will appear before `read_2` in the queryname sorted BAM file whereas `samtools sort` makes no such guarantee. | ||
|
||
## Memory usage and running time | ||
|
||
Memory usage is dependent upon the number of methylation loci on the chromosome (more methylation loci means increased memory usage) and the value of `-m` (roughly, larger values means increased memory usage), but largely independent of the number of reads in the `BAM` file. In contrast, running time is dependent on the number of reads in the `BAM` file and largely independent of the choice of `-m`. | ||
|
||
I will include more detailed performance benchmarks in future releases. For a rough indication of performance, here are the results for processing approximately 41,000,000 100bp paired-end reads from chr1 of a 20-30x coverage whole-genome methylC-seq experiment of human data. This analysis used a single AMD Opteron 6276 CPU (2.3GHz) on a shared memory system. | ||
|
@@ -142,7 +150,8 @@ Memory usage peaked at 2.9GB and the running time was approximately 1.5 hours. | |
Memory usage peaked at 8.8GB and the running time was approximately 1.5 hours. | ||
|
||
## Helper script | ||
I frequently work with large, coordinate-sorted `BAM` files. To speed up the extraction of m-tuples I use a simple (naive) parallelisation strategy. The idea is to split the `BAM` file into chromosome-level `BAM` files, process each chromosome-level `BAM` separately and then recombine these chromosome-level files into a genome-level file. The script `helper_scripts/run_comethylation.sh` implements this strategy; simply edit the key variables in this script. | ||
|
||
I frequently work with large, coordinate-sorted `BAM` files. To speed up the extraction of m-tuples, I use a simple parallelisation strategy with [GNU parallel](http://www.gnu.org/software/parallel/). The idea is to split the `BAM` file into chromosome-level `BAM` files, process each chromosome-level `BAM` separately and then recombine these chromosome-level files into a genome-level file. The script `helper_scripts/run_comethylation.sh` implements this strategy; simply edit the key variables in this script or adapt it to your own needs. | ||
|
||
### Warnings | ||
|
||
|
@@ -173,10 +182,10 @@ Output options: | |
By default, all output files have the same prefix as | ||
that of the input file. This will override the prefix | ||
of output file names | ||
--ss, --strand-specific | ||
Produce strand-specific counts, i.e., don't collapse | ||
methylation calls across Watson and Crick strands. | ||
Required for non-CG methylation type (default: False) | ||
--sc, --strand-collapse | ||
Collapse counts across across Watson and Crick | ||
strands. Only possible for CG methylation type | ||
(default: False) | ||
--nfff, --no-failed-filter-file | ||
Do not create the file listing the reads that failed | ||
to pass to pass the filters and which filter it failed | ||
|
@@ -249,31 +258,37 @@ Other: | |
-v, --version show program's version number and exit | ||
-h, --help show this help message and exit | ||
comethylation (v1.1.0) by Peter Hickey ([email protected], | ||
https://github.com/PeteHaitch/comethylation/ | ||
comethylation (v1.2.0) by Peter Hickey ([email protected], | ||
https://github.com/PeteHaitch/comethylation/) | ||
``` | ||
|
||
|
||
# Limitations and notes | ||
|
||
These are current limitations and their statuses: | ||
|
||
## Only works with data aligned with the __Bismark__ mapping software | ||
|
||
`comethylation` makes use of Bismark's custom SAM tags `XM`, `XR` and `XG`. The `XM` tag is used to infer the methylation state of each sequenced cytosine while the `XR` and `XG` tags are used to infer the orientation and strand of the alignment. If the data were aligned with Bismark version < 0.8.3 please use the `--oldBismark` flag. | ||
|
||
A future version of this software will support the use of `BAM` files created with other popular bisulfite aligners such as [BSMAP](https://code.google.com/p/bsmap/), [BSmooth](https://github.com/BenLangmead/bsmooth-align), [bwa-meth](https://github.com/brentp/bwa-meth/), [gsnap](http://research-pub.gene.com/gmap/), [last](http://last.cbrc.jp/) and [Novoalign](http://www.novocraft.com/). Support will either be provided natively in `comethylation` or via an pre-processing script `bismarkify`. See [Issue #30](https://github.com/PeteHaitch/comethylation/issues/30) | ||
|
||
## Paired-end data must be sorted by queryname | ||
|
||
This is required in order to avoid lookups when finding the mate of a paired-end read. | ||
|
||
The `BAM` file created by Bismark is natively in queryname order so this is not a problem. If the file is not in queryname order then use `samtools sort` with the `-n` option to sort you `BAM` by queryname. The helper script `helper_scripts/run_comethylation.sh` works with a coordinate-sorted `BAM` file and so includes a step to sort the chromosome-level `BAM` files by queryname. | ||
|
||
## `comethylation` will skip any read containing an indel | ||
|
||
It is difficult, although not impossible, to assign coordinates to a cytosine within an indel. To avoid this complication, `comethylation` currently skips any reads containing an indel. I aim to improve the handling of indels in the next release. | ||
|
||
## The `--aligner Bismark_old` option is a bit crude | ||
|
||
Specifically, it assumes that there are no '/' characters in the read names (`QNAME`) and that the BAM has not been processed with any other programs, e.g. Picard's MarkDuplicates, that might change the `FLAG` field. I am happy to improve this functionality if requested. | ||
|
||
## Memory usage can be large | ||
## Memory usage can be large | ||
|
||
For instance, 5-20Gb per chromosome for a typical 20-30x coverage whole-genome methylC-seq experiment of human data. I think this is largely due to inefficiencies in how I store the m-tuples internally in `comethylation` (which is basically as a dictionary of dictionaries). See [Issue #64](https://github.com/PeteHaitch/comethylation/issues/64) | ||
|
||
## Other notes | ||
|
@@ -282,9 +297,11 @@ For instance, 5-20Gb per chromosome for a typical 20-30x coverage whole-genome m | |
* `comethylation` skips paired-end reads where either mate is unmapped. | ||
|
||
# Acknowledgements | ||
|
||
A big thank you to [Felix Krueger](http://www.bioinformatics.babraham.ac.uk/people.html) (the author of Bismark) for his help in understanding mapping of bisulfite-sequencing data and for answering my many questions along the way. | ||
|
||
Thanks also to Tobias Sargeant ([@folded](https://github.com/folded)) for his help in turning the original `comethylation.py` script into the current Python module `comethylation` and for help in setting up a testing framework. | ||
|
||
# Questions and comments | ||
Please use the GitHub Issue Tracker (available at [www.github.com/PeteHaitch/comethylation](www.github.com/PeteHaitch/comethylation)) to file bug reports or request new functionality. I welcome questions and comments; you can email me at [email protected]. | ||
|
||
Please use the GitHub Issue Tracker (available at [www.github.com/PeteHaitch/comethylation](www.github.com/PeteHaitch/comethylation)) to file bug reports or request new functionality. I welcome questions and comments; you can email me at <[email protected]>. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.