From a8db07f91092bf5a6ef61b46afff924bbfccb89a Mon Sep 17 00:00:00 2001
From: Felix Krueger <fkrueger@altoslabs.com>
Date: Fri, 16 Feb 2024 11:04:45 +0000
Subject: [PATCH] Deduplication

---
 docs/options/deduplication.md      | 66 ++++++++++++++++++++++++++++++
 docs/options/genome_preparation.md |  2 +-
 mkdocs.yml                         |  1 +
 3 files changed, 68 insertions(+), 1 deletion(-)
 create mode 100644 docs/options/deduplication.md
diff --git a/docs/options/deduplication.md b/docs/options/deduplication.md
new file mode 100644
index 0000000..e698c25
--- /dev/null
+++ b/docs/options/deduplication.md
@@ -0,0 +1,66 @@
+## Appendix (III): Bismark Deduplication
+
+This script is supposed to remove alignments to the same position in the genome from the Bismark mapping output (both single and paired-end SAM/BAM files), which can arise by e.g. excessive PCR amplification. If sequences align to the same genomic position but on different strands they will be scored individually.
+
+!!! Important:
+
+  Please note that for paired-end BAM files the deduplication script expects Read1 and Read2 to follow each other in consecutive lines! If the file has been sorted by position make sure that you resort it by read name first (e.g. using samtools sort -n)
+
+A brief description of the Bismark deduplication and a full list of options can also be viewed by typing `deduplicate_bismark --help`.
+
+#### USAGE: `deduplicate_bismark [options] <filename(s)>`
+
+#### ARGUMENTS:
+
+- `<filenames>`
+
+A space-separated list of Bismark result files in BAM/SAM format.
+
+
+#### OPTIONS:
+
+- `-s/--single-end`
+
+  Deduplicate single-end BAM/SAM Bismark files. Default: [AUTO-DETECT]
+
+- `-p/--paired-end`
+
+  Deduplicate paired-end BAM/SAM Bismark files. Default: [AUTO-DETECT]
+
+- `-o/--outfile [filename]`
+
+The basename of a desired output file. This basename is modified to end into `.deduplicated.bam`, or `.multiple.deduplicated.bam` in `--multiple` mode, for consistency reasons.
+
+- `--output_dir [path]`
+
+Output directory, either relative or absolute. Output is written to the current directory if not specified explicitly.
+
+- `--barcode`
+
+In addition to chromosome, start position and orientation this will also take a potential barcode into consideration while deduplicating. The barcode needs to be the last element of the read ID and separated by a ':', e.g.: MISEQ:14:000000000-A55D0:1:1101:18024:2858_1:N:0:CTCCT
+
+- `--bam`
+
+The output will be written out in BAM format. This script will attempt to use the path to Samtools that was specified with `--samtools_path`, or, if it hasn't been specified,attempt to find Samtools in the `PATH`. If no installation of Samtools can be found, a GZIP compressed output is written out instead (yielding a `.sam.gz` output file). Default: ON.
+
+- `--sam`
+
+The output will be written out in SAM format. Default: OFF.
+
+- `--multiple`
+
+All specified input files are treated as one sample and concatenated together for deduplication. This uses Unix `cat` for SAM files and `samtools cat` for BAM files. Additional notes for BAM files:	Although this works on either BAM or CRAM, all input files must be the same format as each other. The sequence dictionary of each input file must be identical, although this command does not check this. By default the header is taken from the first file to be concatenated.
+
+- `--samtools_path [path]`
+
+The path to your Samtools installation, e.g. `/home/user/samtools/`. Does not need to be specified explicitly if Samtools is in the `PATH` already
+
+- `--version`
+
+Print version information and exit
+
+
+#### OUTPUT
+
+The output is a BAM format by default, as well as a deduplication report (ending in '_deduplication_report.txt') 
+
diff --git a/docs/options/genome_preparation.md b/docs/options/genome_preparation.md
index e1b2657..40b44ae 100644
--- a/docs/options/genome_preparation.md
+++ b/docs/options/genome_preparation.md
@@ -64,7 +64,7 @@ Force generated index to be 'large', even if reference has fewer than 4 billion
 
 This script is supposed to convert a specified reference genome into two different bisulfite converted versions and index them for alignments with Bowtie 2 (default), HISAT2 or minimap2. The first bisulfite genome will have all Cs converted to Ts (C->T), and the other one will have all Gs converted to As (G->A).
 Both bisulfite genomes will be stored in subfolders within the reference genome folder containing the unconverted reference sequence (in FastA format). Once the bisulfite
-conversion has been completed, the program will fork and launch two simultaneous instances of the Bowtie 2, HISAT2 or minimap2 indexer (`bowtie2-build` or `hisat2-build` or `minimap2 -d`, resepctively). Here is the structure of the reference genome folder after successful indexing (with Bowtie2 in this case):
+conversion has been completed, the program will fork and launch two simultaneous instances of the Bowtie 2, HISAT2 or minimap2 indexer (`bowtie2-build` or `hisat2-build` or `minimap2 -d`, respectively). Here is the structure of the reference genome folder after successful indexing (with Bowtie2 in this case):
 
 ```
 ├── Bisulfite_Genome
diff --git a/mkdocs.yml b/mkdocs.yml
index aa6eb36..ee04c60 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -52,6 +52,7 @@ nav:
       - options/genome_preparation.md
       - options/alignment.md
       - options/methylation_extraction.md
+      - options/deduplication.md
   - FAQ:
       - faq/README.md
       - faq/single_cell_pbat.md