Skip to content

Commit

Permalink
Merge pull request #81 from PeteHaitch/add_strand
Browse files Browse the repository at this point in the history
Strand information is now recorded in output as either '+' (forward strand), '-' (reverse strand) or '*' (not applicable). Removed --strand-specific parameter and replaced with --strand-collapse, which effectively has the opposite meaning.
  • Loading branch information
PeteHaitch committed Jun 20, 2014
2 parents 9ab3b53 + f2d8f45 commit 72ce25b
Show file tree
Hide file tree
Showing 6 changed files with 209 additions and 157 deletions.
55 changes: 36 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
`comethylation` is a methylation caller for methylation events that co-occur on the same DNA fragment from high-throughput bisulfite sequencing data, such as `methylC-seq`. A typical read from such an experiment reports a binary methylated or unmethylated measurement at multiple loci, where each read originates from a single cell. `comethylation` allows us to investigate the co-occurence of methylation marks at the level of a single cell.

# Installation and dependencies

Simply running

```
Expand All @@ -25,6 +26,7 @@ A 2-tuple (m = 2) is a pair of methylation loci; `comethylation` tabulates the n
For a chosen value of _m_, all _m-tuples_ are extracted and tabulated across the genome. The key point is that _m-tuples_ are only constructed if all _m_ loci are within a single read, so we can investigate the co-occurence of methylation events at the level of a single cell.

## Basic usage

`comethylation` processes a single `BAM` file and works for both single-end and paired-end sequencing data. Example `BAM` files from single-end directional and paired-end directional bisulfite-sequencing experiments are available in the `data/` directory.

Methylation measurements may be filtered by base quality or other criteria such as the mapping quality of the read or whether the read is marked as a PCR duplicate. For a full list of filtering options, please run `comethylation --help` or see the __Advanced Usage__ section below.
Expand All @@ -34,18 +36,21 @@ Currently, the BAM file must have been created with [Bismark](http://www.bioinfo
The main options to pass `comethylation` are the size of the m-tuple (`-m`); the type of methylation, which is some combination of _CG_, _CHG_, _CHH_ and _CNN_ (`--methylation-type`); any filters to be applied to reads or positions within reads (see below); the BAM file; and the sample name, which will be used as a prefix for all output files. Multiple methylation types may be specified jointly, e.g., `--methylation-type CG --methylation-type CHG`

## Output

Three output files are created and summary information is written to `STDOUT`. The main output file is a tab-delimited file of all m-tuples (`<in>.<--methylation-type>.<-m>.tsv`), where `<in>` is the prefix of the `<in.bam>` BAM file.

Here are the 5 rows (including with the header row) from `data/se_directional.fq.gz_bismark_bt2.CG.2.tsv`, which is created by running the single-end directional example shown below:
Here are the first 5 rows (including with the header row) from `data/se_directional.fq.gz_bismark_bt2.CG.2.tsv`, which is created by running the single-end directional example shown below:

```
chr pos1 pos2 MM MU UM UU
chr1 3079983 3079993 0 0 1 0
chr1 6387768 6387783 1 0 0 0
chr1 6790760 6790796 0 0 0 1
chr1 6790796 6790841 0 0 1 0
chr strand pos1 pos2 MM MU UM UU
chr1 + 6387768 6387783 1 0 0 0
chr1 + 7104116 7104139 1 0 0 0
chr1 + 7104139 7104152 1 0 0 0
chr1 + 9256170 9256179 0 0 0 1
```
So, for example, at the CpG 2-tuple chr1:(3,079,983, 3,079,993) we observed 1 read that was unmethylated at chr1:3,079,983 and methylated at chr1:3,079,993.
So, for example, at the CpG 2-tuple chr1:+:(6,387,768, 6,387,783) we observed 1 read that was methylated at chr1:+:6,387,768 and methylated at chr1:+:6,387,783.

The `strand` is recorded as `+` (forward strand, "OT" in Bismark), `-` (reverse strand, "OB" in Bismark) or `*`, meaning not applicable, if the `--strand-collapse` option is set. The position of all methylation loci is always with respect to the forward strand.

The second file (`<in>.<--methylation-type>_per_read.hist`) is a text histogram of the number of methylation loci per read or readpair (of the type specified by `--methylation-type`) that passed the filters specified at runtime of `comethylation`.

Expand Down Expand Up @@ -83,12 +88,14 @@ SRR921747.81_JONAS:2105:C0G2AACXX:4:2205:18836:56196_length=101 read has indel
So, both of these reads were filtered out because they contained indels (see section "__Limitations and notes__").

## Examples

Two small example datasets are included in the `data/` directory. Included are the `FASTQ` files and the `BAM` files generated with __Bismark__ in __Bowtie2__ mode. More details of the example datasets can be found in `data/README.md`

Although the example datasets are both from directional bisulfite-sequencing protocols, `comethylation` works with data from both directional and non-directional bisulfite-sequencing protocols.


### Single-end

The following command will extract all CpG 2-tuples from the file `data/se_directional.bam`:

```
Expand All @@ -102,12 +109,12 @@ This results in 3 files:
* `data/se_directional.fq.gz_bismark_bt2.reads_that_failed_QC.txt`

### Paired-end

Paired-end data must firstly be sorted by queryname prior to running `comethylation`. `BAM` files created by Bismark, such as `data/pe_directional.bam`, are already sorted by queryname. So, to extract all CG/CHH 3-tuples we would simply run:

```
comethylation -m 3 --methylation-type CG --methylation-type CHH --strand-specific data/pe_directional_1.fq.gz_bismark_bt2_pe.bam
comethylation -m 3 --methylation-type CG --methylation-type CHH data/pe_directional_1.fq.gz_bismark_bt2_pe.bam
```
Note that we had to specify `--strand-specific` because __CHH__ methylation is not symmetric across the Watson and Crick strands.

This results in 3 files:

Expand All @@ -125,12 +132,13 @@ samtools sort data/pe_directional_1.fq.gz_bismark_bt2_pe.bam data/cs_pe_directio
# Re-sort the coordinate-sorted BAM by queryname
SortSam I=data/cs_pe_directional_1.fq.gz_bismark_bt2_pe.bam O=data/qs_pe_directional_1.fq.gz_bismark_bt2_pe SO=queryname
# Run comethylation on the queryname sorted BAM
comethylation -m 3 --methylation-type CG --methylation-type CHG --strand-specific data/qs_pe_directional_1.fq.gz_bismark_bt2_pe.bam
comethylation -m 3 --methylation-type CG --methylation-type CHG data/qs_pe_directional_1.fq.gz_bismark_bt2_pe.bam
```

__Note:__ Please use `SortSam` from the `Picard` library rather than `samtools sort`; `SortSam` guarantees that `read_1` will appear before `read_2` in the queryname sorted BAM file whereas `samtools sort` makes no such guarantee.

## Memory usage and running time

Memory usage is dependent upon the number of methylation loci on the chromosome (more methylation loci means increased memory usage) and the value of `-m` (roughly, larger values means increased memory usage), but largely independent of the number of reads in the `BAM` file. In contrast, running time is dependent on the number of reads in the `BAM` file and largely independent of the choice of `-m`.

I will include more detailed performance benchmarks in future releases. For a rough indication of performance, here are the results for processing approximately 41,000,000 100bp paired-end reads from chr1 of a 20-30x coverage whole-genome methylC-seq experiment of human data. This analysis used a single AMD Opteron 6276 CPU (2.3GHz) on a shared memory system.
Expand All @@ -142,7 +150,8 @@ Memory usage peaked at 2.9GB and the running time was approximately 1.5 hours.
Memory usage peaked at 8.8GB and the running time was approximately 1.5 hours.

## Helper script
I frequently work with large, coordinate-sorted `BAM` files. To speed up the extraction of m-tuples I use a simple (naive) parallelisation strategy. The idea is to split the `BAM` file into chromosome-level `BAM` files, process each chromosome-level `BAM` separately and then recombine these chromosome-level files into a genome-level file. The script `helper_scripts/run_comethylation.sh` implements this strategy; simply edit the key variables in this script.

I frequently work with large, coordinate-sorted `BAM` files. To speed up the extraction of m-tuples, I use a simple parallelisation strategy with [GNU parallel](http://www.gnu.org/software/parallel/). The idea is to split the `BAM` file into chromosome-level `BAM` files, process each chromosome-level `BAM` separately and then recombine these chromosome-level files into a genome-level file. The script `helper_scripts/run_comethylation.sh` implements this strategy; simply edit the key variables in this script or adapt it to your own needs.

### Warnings

Expand Down Expand Up @@ -173,10 +182,10 @@ Output options:
By default, all output files have the same prefix as
that of the input file. This will override the prefix
of output file names
--ss, --strand-specific
Produce strand-specific counts, i.e., don't collapse
methylation calls across Watson and Crick strands.
Required for non-CG methylation type (default: False)
--sc, --strand-collapse
Collapse counts across across Watson and Crick
strands. Only possible for CG methylation type
(default: False)
--nfff, --no-failed-filter-file
Do not create the file listing the reads that failed
to pass to pass the filters and which filter it failed
Expand Down Expand Up @@ -249,31 +258,37 @@ Other:
-v, --version show program's version number and exit
-h, --help show this help message and exit
comethylation (v1.1.0) by Peter Hickey ([email protected],
https://github.com/PeteHaitch/comethylation/
comethylation (v1.2.0) by Peter Hickey ([email protected],
https://github.com/PeteHaitch/comethylation/)
```


# Limitations and notes

These are current limitations and their statuses:

## Only works with data aligned with the __Bismark__ mapping software

`comethylation` makes use of Bismark's custom SAM tags `XM`, `XR` and `XG`. The `XM` tag is used to infer the methylation state of each sequenced cytosine while the `XR` and `XG` tags are used to infer the orientation and strand of the alignment. If the data were aligned with Bismark version < 0.8.3 please use the `--oldBismark` flag.

A future version of this software will support the use of `BAM` files created with other popular bisulfite aligners such as [BSMAP](https://code.google.com/p/bsmap/), [BSmooth](https://github.com/BenLangmead/bsmooth-align), [bwa-meth](https://github.com/brentp/bwa-meth/), [gsnap](http://research-pub.gene.com/gmap/), [last](http://last.cbrc.jp/) and [Novoalign](http://www.novocraft.com/). Support will either be provided natively in `comethylation` or via an pre-processing script `bismarkify`. See [Issue #30](https://github.com/PeteHaitch/comethylation/issues/30)

## Paired-end data must be sorted by queryname

This is required in order to avoid lookups when finding the mate of a paired-end read.

The `BAM` file created by Bismark is natively in queryname order so this is not a problem. If the file is not in queryname order then use `samtools sort` with the `-n` option to sort you `BAM` by queryname. The helper script `helper_scripts/run_comethylation.sh` works with a coordinate-sorted `BAM` file and so includes a step to sort the chromosome-level `BAM` files by queryname.

## `comethylation` will skip any read containing an indel

It is difficult, although not impossible, to assign coordinates to a cytosine within an indel. To avoid this complication, `comethylation` currently skips any reads containing an indel. I aim to improve the handling of indels in the next release.

## The `--aligner Bismark_old` option is a bit crude

Specifically, it assumes that there are no '/' characters in the read names (`QNAME`) and that the BAM has not been processed with any other programs, e.g. Picard's MarkDuplicates, that might change the `FLAG` field. I am happy to improve this functionality if requested.

## Memory usage can be large
## Memory usage can be large

For instance, 5-20Gb per chromosome for a typical 20-30x coverage whole-genome methylC-seq experiment of human data. I think this is largely due to inefficiencies in how I store the m-tuples internally in `comethylation` (which is basically as a dictionary of dictionaries). See [Issue #64](https://github.com/PeteHaitch/comethylation/issues/64)

## Other notes
Expand All @@ -282,9 +297,11 @@ For instance, 5-20Gb per chromosome for a typical 20-30x coverage whole-genome m
* `comethylation` skips paired-end reads where either mate is unmapped.

# Acknowledgements

A big thank you to [Felix Krueger](http://www.bioinformatics.babraham.ac.uk/people.html) (the author of Bismark) for his help in understanding mapping of bisulfite-sequencing data and for answering my many questions along the way.

Thanks also to Tobias Sargeant ([@folded](https://github.com/folded)) for his help in turning the original `comethylation.py` script into the current Python module `comethylation` and for help in setting up a testing framework.

# Questions and comments
Please use the GitHub Issue Tracker (available at [www.github.com/PeteHaitch/comethylation](www.github.com/PeteHaitch/comethylation)) to file bug reports or request new functionality. I welcome questions and comments; you can email me at [email protected].

Please use the GitHub Issue Tracker (available at [www.github.com/PeteHaitch/comethylation](www.github.com/PeteHaitch/comethylation)) to file bug reports or request new functionality. I welcome questions and comments; you can email me at <[email protected]>.
2 changes: 1 addition & 1 deletion comethylation/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
__version_info__ = ('1','1','0')
__version_info__ = ('1','2','0')
__version__ = '.'.join(__version_info__)

from . import funcs
Expand Down
Loading

0 comments on commit 72ce25b

Please sign in to comment.