Releases: FelixKrueger/Bismark
Version 0.24.2
Just a few fixes, also added two flavours of scripts for merging coverage files (e.g. for when R1 and R2 had been run in single-end mode)
Bismark
- removed an
exit 0
that would terminate runs after processing a single (set of) input file(s).
deduplicate_bismark
- Changed the path to Samtools to custom variable (#609)
coverage2cytosine
- set threshold reads to 1 (if it was 0) for
--gc_context
as intended and mentioned in the help text. Fixes #621
monolithic beast no more
-
Added entirely new documentation website, built using Material for Mkdocs. Thanks to @ewels for a fantastic (late-night) effort to break up and restructure what had become a fairly unwieldy monolithic beast of markdown document...
-
Added docs for cytosine context summary, useful for
GpC
methylation or filtering for specific C context (e.g.CpA
) -
Updated docs for the dovetailing
Bismark
- Warning messages for closing ambiguous and unmapped file handles only occur when these options were specified see here
0.24.0 - long read support with minimap2
Bismark
-
Added new option
--strandID
which reports the alignment strand identity for paired-end, non-directional libraries, e.g.YS:Z:CTOT
. This information may be difficult to obtain if third party tools interfered with the read ordering (admittedly there is a fine balance of read reporting position, FLAG, Read 1 and Genome conversion state to make it work in the first place. More information can be found in this thread). -
runs with
--parallel/--multicore
> 1 specified will now terminate with an error message whenever one of the child processes fails. This prevents potentially incomplete result files making it through to the end unnoticed (more #494) -
runs with
--parallel/--multicore
> 1 as well as--unmapped
and/or--ambiguous
specified will no longer produce potentially corrupt FastQ files (more #495) -
Added option
--mm2/--minimap2
to use minimap2 as the underlying aligner. The minimap2 alignment modes include Oxford Nanopore, PacBio and accurate short reads. In its current implementation, minimap2 can be invoked in one of the following ways: -
--mm2_nanopore
: Sets preset settings for Oxford Nanopore vs reference mapping '-x map-ont' [default] -
--mm2_pacbio
: Sets preset settings for PacBio vs. reference mapping '-x map-pb' -
--mm2_short_reads
: Sets preset settings for accurate short reads '-x sr' -
added option
--mm2_maximum_length <int>
to set a maximum length cutoff, which might be required for very long reads exceeding the maximum number of CIGAR operations tolerated by the BAM formatted reads (>65535). The default is 10,000 bp.
Other options that are currently set within Bismark include '-a' (SAM output), '--MD' (MD tag), '--secondary=no'.
Prompted by fairly slow alignment speeds with the minimap2 default settings, we set out to improve the performance of the alignment process by tweaking several different parameters
Speed optimisiation: optimisation of minimap2 parameters
k-mer size
Due to the reduced DNA alphabet the minimap2 default k-mer size of 15 leads to substantially higher alignment times. Based on our tests we settled for a new default of ‘-k 20’
minibatch size
The minimap2 default minibatch size of 500 million bp means that a substantial amount of data is aligned and held in memory before additional alignment threads can be started. Reducing the minibatch size to 250K reads seemed to be a good compromise (‘-K 250K’).
minimap2 multi-threading
minimap2 alignments may utilize multiple cores for each alignment process; we found that ‘-t 2’ offered a good speed-up, while allowing additional resources had diminishing returns.
Bismark multi-threading
We also tested the potential of using additional resources for Bismark itself (--parallel), which appeared to result in a speed-up of the alignment process as expected; however this comes at the cost of requiring additional system resources.
As a result of these tests, we changed the default settings for minimap2 alignment parameters to ‘-t 2 -k 20 -K 250K’.
methylation_consistency
- Added new option
--chh
to use cytosines in CHH instead of CpG context to enable some trouble shooting and method development
bismark2report
- The CHH/CHG labels for the Cytosine Methylation after Extraction plot now appear in the correct order
bismark_methylation_extractor
-
removed a print statement that would flood STDOUT the logfile if
--merge_non_CG
(but not--comprehensive
) had been selected -
runs with
--parallel/--multicore
specified will now terminate with an error message whenever one of the child processes fails. This prevents potentially incomplete result files making it through to the end unnoticed -
changed the option
-o/--output
to-o/--output_dir
for consistency reasons...
bismark_genome_preparation
- Added option
--mm2/--minimap2
. The genome indexing process (bismark_genome_preparation
) writes out a minimap2 index to the genome folder, using the optimized k-mer size of-k 20
(see comments for bismark itself). This pre-generated minimap2 index takes precedence over indexing options that would otherwise happen as part of the alignment procedure.
deduplicate_bismark
- when using an output filename
-o customname
the deduplication report will also be derived from customname.
Added a sentence to the Docs that Genozip 14 and above supports Bismark BAM files (with a substantial gain in compression).
fix auto-detection
filter_non_conversion
- fixed global setting of
--paired
or--single
mode. Auto-detection now works by only looking at the@PG ID:Bismark
line of the SAM header.
methylation_consistency
- Auto-detection now works by only looking at the
@PG ID:Bismark
line of the SAM header.
coverage2cytosine
- Swapped the columns for count methylated and count unmethylated for the context summary report to match the header line.
v0.23.0
Bismark Release v0.23.0
- Migrated CI tests from Travis to Github Actions
deduplicate_bismark
-
the command
deduplicate_bismark --barcode *bam
now works again. Previously the output file names were accidentally all derived from the first supplied file in--barcode
(= UMI) mode (it had been fixed for normal files in 0.22.2). -
Changed the way the library auto-detection works to only look at the
@PG ID:Bismark
line of the SAM header (to only look for the Bismark command)
bismark_methylation_extractor / bismark2bedGraph
-
Added a new option
--ucsc
tobismark2bedGraph
andbismark_methylation_extractor
that will produce a UCSC-ready bedGraph file if the genome version used came from Ensembl. This option (i) prefixes chromosome names with 'chr', and (ii) changes the mitochondrial chromosome from 'MT' to 'chrM'. In addition, it will also write out a new file ending in.chromosome_sizes.txt
for easier use ofbedGraphToBigWig
. More here. -
Changed the way the library auto-detection works to only look at the
@PG ID:Bismark
line of the SAM header.
coverage2cytosine
-
Added a new output file for all cytosine context methylation totals. More information here: #321.
-
Added new option
--drach/--m6A
. Mostm6A
sites are found in the conserved sequence motifDRACH
(whereD
=G
/A
/U
,R
=G
/A
,H
=A
/U
/C
), and if bound by anti-m6A antibody, it causes the reverse transcriptase to introduceC
toT
transitions at the cytosine which followsA
in theDRACH
motif. This option also sets a coverage threshold of at 1 unless specified explicitly. This is a very specialised option and should only be used by experimentalists looking atm6A
methylation (where the C to T transition acts as a proxy ofm6A
).
bismark2summary
- Samples with absolutely 0 methylation calls in some context are now excluded from the graphical HTML output (as they break rendering the entire summary graph section). These samples and their statistics do still appear in the file
bismark_summary_report.txt
. More information here: #315.
v0.22.3
Bismark
- Accepted pull request to fix the MAPQ score calculation in
local
mode.
methylation_consistency
- Added a new script to assess the concordance of methylation calls. See more here: https://github.com/FelixKrueger/Bismark/tree/master/Docs#x-concordance-of-methylation-calls-across-bisulfite-reads
0.22.2
- Added FAQ document for questions that keep coming up. Will be populated over time.
Bismark
-
the option
--non_bs_mm
is now only allowed in end-to-end mode -
Fixed the calculation of non bisulfite mismatches for paired-end data which happened correctly only when R2 had an InDel (see here)
-
When the option
-u
was used in conjunction with--parallel
, only-u
sequences will be written to the temporary subset files for each spawn of Bismark (previously, the entire file was split for--parallel
, but then only a small subset of those files was used for-u
, which resulted in very long runs even for a small number of analysed sequences)
deduplicate_bismark
- the command
deduplicate_bismark *bam
now works again. Previously the output file names were accidentally all derived from the first supplied file.
coverage2cytosine
- Added new option
--coverage_threshold INT
. Positions have to be covered by at least INT calls (irrespective of their methylation state) before they get reported. For NOMe-seq, the minimum threshold is automatically set to 1 unless specified explicitly. Setting a coverage threshold does not work in conjunction with--merge_CpGs
(as all genomix CpGs are required for this). Default: 0 (i.e. all genomic positions get reported)
bismark2report
- added seconds to the timestamp report statement (which caused a warning on certain, but not all, platforms)
bismark2summary
- Now reads splitting reports even for non-deduplicated files (such as RRBS).
Essential Easter Performance Release [EEPR]
Bismark
-
Hot-fixed (read: removed) the cause of delay during the
MD:Z:
field computation for reads containing a deletion (which was roughly equal to 1 second per read). Apologies, I did it again... -
Changed the default
--score_min
function for HISAT2 in--local
mode back to a linear function (instead of using the logarithmic model that is employed by Bowtie 2). The default is now--score_min L,0,-0.2
for both end-to-end (default) and--local
mode. It should be mentioned that we currently don't understand how exactly the scoring mode in HISAT2 works (even though the scores appear to be all negative with a maximum value of 0), so this might change somewhat in the future. See here for more info.
Easter Release - local alignments for single-cell and scNMT-seq
Expanding on our observation that single-cell BS-seq, or PBAT libraries in general, can generate chimeric read pairs, a recent publication by Wu et al. described in further detail that intra-fragment chimeras can hinder the efficient alignment of single-cell BS-seq libraries. In there, the authors described a pipeline that uses paired-end alignments first, followed by a second, single-end alignment step that uses local alignments in a bid to improve the mapping of intra-molecular chimeras. To allow this type of improvement for single-cell or PBAT libraries, we have been experimenting with allowing local alignments.
Please note that we still do not recommend using local alignments as a means to magically increase mapping efficiencies (please see here), but we do acknowledge that PBAT/scBSs-seq/scNMT-seq are exceptional applications where local alignments might indeed make a difference (there is only so much data to be had from a single cell...).
We didn't have the time yet to set more appropriate or stringent default values for local alignments (suggestions welcome), nor did we investigate whether the methylation extraction will require an additional --ignore
flag if a read was found to the be soft-clipped (the so called 'micro-homology domains'). This might be added in the near future.
Bismark
-
Added support for local alignments by introducing the new option
--local
. This means that the CIGAR operationS
(soft-clipping) is now supported -
fixed typo in option
--path_to_bowtie2
(a single missing2
was preventing the specified path to be accepted) -
fixed typo in option
--no-spliced-alignment
in HISAT2 mode -
fixed missing end-of-line character for unmapped or ambiguous FastQ sequences in paired-end FastQ mode
-
fixed output file naming in
--hisat2
and--parallel
mode (_hisat2 was missing in--parallel
mode). Thanks to @phue for spotting this.
bismark_genome_preparation
- Added option
--large-index
to force the generation of LARGE genome indexes. This may be required for indexing extremely large genomes (e.g. the Axolotl (32 GigaBases)) in--parallel
mode. For more information on why the indexing was failing previously see here
bismark_methylation_extractor
- Now supporting reads containing soft-clipped bases (CIGAR operation S)
bam2nuc
- Now supporting reads containing soft-clipped bases (CIGAR operation S)
deduplicate_bismark
- Now supporting reads containing soft-clipped bases (CIGAR operation S)
[New]: HISAT2 and SLAM-mode; [Retired]: Bowtie 1
For the upcoming version Bismark has undergone some substantial changes, which sometimes affect more than one module within the Bismark suite. Here is a short description of the major changes:
[Retired]: Bowtie 1 support
Bowtie (1)
support, and all of its options, has been completely dropped frombismark_genome_preparation
andbismark
. This decision was not made lightly, but it seems no one is using the original Bowtie short read aligner anymore, even short reads have moved on...- Consequently, the option
--vanilla
and its handling has been removed from a number of modules (bismark_genome_preparation
,bismark
,bismark_methylation_extractor
anddeduplicate_bismark
). Too bad, I liked that name...
[Added]: HISAT2 support
-
Instead, the DNA and RNA aligner HISAT2 has been added as a new choice of aligner. The reason for this is not necessarily that RNA methylation is now a thing, but certain alignment modes (see below) do require splice-aware mapping if we don't want to miss out on a whole class of (spliced) alignments. Bowtie 2 is the default mode, HISAT2 alignments can be enabled with the option
--hisat2
-
Similar to the Bowtie2 mode, alignments with HISAT2 are restricted to global (end-to-end) alignments, i.e. soft-clipping is disabled. Furthermore, in paired-end mode, the options
--no-mixed
and--no-discordant
are permanently enabled, meaning that only properly aligned read pairs are put out. -
As the
--hisat2
mode supports spliced alignments, the newCIGAR
operationN
is now supported in all Bismark modules (this includesbismark_genome_preparation
,bismark
,bismark_methylation_extractor
,deduplicate_bismark
and some others).
At the time of writing this, the --hisat2
mode appears to be working as expected. It should be mentioned however that we have not done a lot of testing of these new files, so comments and feedback are welcome.
SLAM-seq mode
We also added a new, experimental and completely different type of alignment for SLAM-seq type data (option --slam
). This fairly recent method to interrogate newly synthesized messenger RNA is akin to bisulfite conversion, in that newly synthesized RNA may contain T to C conversions following an alkylation reaction (original publication and https://www.nature.com/articles/nmeth.4435). The new Bismark alignment mode --slam
performs T>C conversions of both the genome (in the genome preparation step) and the subsequent alignment steps (Bismark alignment step). Currently, the rest of the processing of SLAM-seq data hijacks the standard methylation pipeline:
- T>C conversions are written out as
methylation events
in CpG context, while T-T matches are scored asunmethylated events
in CpG context. Other cytosine contexts are not being used.
So in a nut-shell: methylation calls in --slam
mode are either Ts (unmethylated calls = matches at T positions), or T to C mismatches (methylated calls = C mismatches at T positions).
It should be noted that this is currently an experimental workflow. One might argue that T/C conversion aware (or T/C mis-mapping agnostic) mapping is currently not necessary for SLAM-seq, NASC-Seq, or scSLAM-seq data as the labeling reaction is very inefficient (1 in only 50 to 200 newly incorporated Ts is a 4sU, which may get alkylated). This might be true - for now. If and when the conversion reaction improves over time, C/T agnostic mapping, similar to bisulfite-Seq data, might very well become necessary.
Here is a screenshot of a comparison of aligning the same data (SLAM-seq-like) with Bismark in Bowtie 2 mode (top track) and HISAT2 mode (middle track). Alignments with HISAT2 recover a lot more alignments to short exons, as well as exon-exon spanning reads (evidenced in bottom track):
- Added documentation for NOMe-seq or scNMT-seq processing.
bismark
-
Dropped support for Bowtie
-
Removed all traces of
--vanilla
-
Added support for HISAT2 with option
--hisat2
. -
Added HISAT2 option
--no-spliced-aligments
to disable spliced alignments altogether -
Added HISAT2 option
--known-splicesite-infile <path>
to provide a list of known splice sites. -
Added option
--slam
to allow T/C mismatch agnostic mapping (3-letter alignment). More here. -
Added a new option
--icpc
to truncate read IDs at the first space (or tab) it encounters in the (FastQ) read ID, which are sometimes used to add comments to a FastQ entry (instead of replacing them with underscores which is the default behaviour).
bismark_genome_preparation
-
Dropped support for Bowtie
-
Added support for HISAT2 with option
--hisat2
. -
Added option
--slam
. Instead of performing an in-silico bisulfite conversion, this mode transforms T to C (forward strand), or A to G (reverse strand). The folder structure and rest of the indexing process is currently exactly the same as for bisulfite sequences, but this might change at some point. This means that a genome prepared in--slam
mode is currently indistinguishable from a true Bisulfite Genome (until the alignments are in) so please make sure you name the genome folder appropriately to avoid confusion.
deduplicate_bismark
-
Removed all traces of
--vanilla
-
--bam
mode is now the default. Uncompressed SAM output may still be obtained using the new option--sam
-
Added new option
-o/--outfile <basename>
. This basename is then modified to remove file endings such as.bam
,.sam
,.txt
or.gz
, and.deduplicated.bam
, or.multiple.deduplicated.bam
in--multiple
mode, is then appended for consistency reasons.
- Added support for new CIGAR operation
N
bismark_methylation_extractor
-
Added support for new CIGAR operation
N
for all extraction modes -
Removed all traces of
--vanilla
bismark2summary/bismark2report
- Adapted to work with Bismark HISAT2 reports instead of Bowtie 1 reports.
bam2nuc
- Reads containing spliced reads are now also skipped when determining the genomic base composition (as are reads with InDels).