release v1.0 freeze

Zhong-Lab-UCSD · Feb 27, 2019 · a5807e7 · a5807e7
1 parent 306cfdd
commit a5807e7
Show file tree

Hide file tree

Showing 56 changed files with 1,058 additions and 937 deletions.
diff --git a/README.md b/README.md
@@ -15,6 +15,9 @@ iMARGI-Docker distributes the iMARGI sequencing data processing pipeline
       - [Build with Dockerfile](#build-with-dockerfile)
   - [Software Testing Demo](#software-testing-demo)
     - [Testing Data](#testing-data)
+      - [iMARGI sequencing data (paired FASTQ)](#imargi-sequencing-data-paired-fastq)
+      - [Reference genome data (FASTA)](#reference-genome-data-fasta)
+      - [bwa index data](#bwa-index-data)
     - [Testing Command](#testing-command)
     - [Testing Results](#testing-results)
       - [Running Time Profile](#running-time-profile)
@@ -106,19 +109,37 @@ To test whether you have successfully installed iMARGI-Docker, you can follow in
 
 ### Testing Data
 
+#### iMARGI sequencing data (paired FASTQ)
+
 As real iMARGI sequencing data are always very big, so we randomly extracted a small chunk of real data for software
 testing. The data can be found in [`data`](./data/) folder. Please download them to your computer.
 
-Besides, you need to download a human genome reference FASTA file. 
+- [R1 reads](https://github.com/Zhong-Lab-UCSD/iMARGI-Docker/raw/master/data/sample_R1.fastq.gz)
+- [R2 reads](https://github.com/Zhong-Lab-UCSD/iMARGI-Docker/raw/master/data/sample_R2.fastq.gz)
+
+#### Reference genome data (FASTA)
+
+Besides, you need to download a human genome reference FASTA file.
 We use the reference genome used by
 [4D Nucleome](https://www.4dnucleome.org/) and
-[ENCODE project](https://www.encodeproject.org/data-standards/reference-sequences/). The FASTA file of the reference
+[ENCODE project](https://www.encodeproject.org/data-standards/reference-sequences/).
+
+The FASTA file of the reference
 genome is too large for us to host it in GitHub repo. You can be download it use the link:
-[GRCh38_no_alt_analysis_set_GCA_000001405.15](https://www.encodeproject.org/files/GRCh38_no_alt_analysis_set_GCA_000001405.15/@@download/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz).
+
+- [GRCh38_no_alt_analysis_set_GCA_000001405.15](https://www.encodeproject.org/files/GRCh38_no_alt_analysis_set_GCA_000001405.15/@@download/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz)
+
 It needs to be decompressed using `gunzip -d` command on Linux/MacOS. If your system is Windows, you can use 7Zip
 software to decompress the `.gz` file. Besides, you can also use the `gunzip` tool delivered in iMARGI-Docker.
 
-We assume that you put the data and reference files in the following directory structure.
+#### bwa index data
+
+As `bwa index` process will cost a lot of time (more than 1 hour), we suggest to download our pre-built index files for the reference
+genome. Please download the following gzip compressed `bwa_index` folder and decompress it (`tar zxvf`) on your machine.
+
+- [bwa index files](https://sysbio.ucsd.edu/imargi_pipeline/bwa_index.tar.gz)
+
+*We assume that you put the data and reference files in the following directory structure.*
 
 ``` bash
 ~/imargi_example
@@ -127,7 +148,13 @@ We assume that you put the data and reference files in the following directory s
     │   └── sample_R2.fastq.gz
     ├── output
     └── ref
-        └── GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta
+        ├── GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta
+        └── bwa_index
+            ├── bwa_index_hg38.amb
+            ├── bwa_index_hg38.ann
+            ├── bwa_index_hg38.bwt
+            ├── sample_R1.fastq.pac
+            └── sample_R2.fastq.sa
 ```
 
 ### Testing Command
@@ -141,6 +168,7 @@ docker run -u 1043 -v ~/imargi_example:/imargi zhonglab/imargi \
     -N test_sample \
     -t 4 \
     -g ./ref/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta \
+    -i ./ref/bwa_index/bwa_index_hg38 \
     -1 ./data/sample_R1.fastq.gz \
     -2 ./data/sample_R2.fastq.gz \
     -o ./output
@@ -160,23 +188,30 @@ docker run -u 1043 -v ~/imargi_example:/imargi zhonglab/imargi \
 - The command line is long, so `\` was used for splitting it into multiple lines in the example. It's a Linux or MacOS
   style. However, in Windows, you need to replace `\` with `^`.
 
-- Building bwa index costs the most running time. If you have human genome bwa index built before, you can supply it
-  with `-i` argument. See more details in the
+- `-i`: Building bwa index will cost a lot time, so we used the pre-built index files with `-i` argument. There
+  are some other arguments can be used for pre-generated files, such as `-R` for restriction fragment BED file and
+  `-c` for chromsize file.See more details in the
   [documentation of command line API section](https://sysbio.ucsd.edu/imargi_pipeline/commandline_api.html#imargi-wrapper-sh)
-`
+
+- `-i`: If you don't supply bwa index files, the `imargi_wrapper.sh` will generated     it automatically. It works
+  perfectly on Linux system. However, it doesn't work on Windows and MacOS because `bwa index` use `fsync` when build
+  large genome index, which cannot handle different driver formats (`-v` mount Windows/MacOS driver to Linux container).
+  So it's better to build it in advance. In fact, there's a solution to the problem if you are familiar with Docker
+  volume. Please read the
+  [technical note of iMARGI pipeline documentation](https://sysbio.ucsd.edu/imargi_pipeline/technical_note.html#solve-bwa-index-failure-problem) for
+  detail.
 
 ### Testing Results
 
 #### Running Time Profile
 
-It took about 85 minutes to perform the pipeline. The most of time (75 min) was consumed by building bwa index files.
-So once you built the bwa index, supply it to the command with `-i` next time.
+It took about 15 minutes to perform the pipeline (with `-i` bwa index argument).
 
 Step | Time | Speed up suggestion
 ---------|----------|----------
-Generating chromosome size file | 10 sec | It's fast, but you can supply with `-c` once you've generated it.
-Generating bwa index | 75 min | Supply with `-i` once you've built it.
-Generating restriction fragment file | 4 min | Supply with `-R` once you've created it.
+Generating chromosome size file | 10 sec | It's fast, but you can also supply with `-c` once you've generated it before.
+Generating bwa index (skipped) | 75 min | Supply with `-i` if you've pre-built index files.
+Generating restriction fragment file | 4 min | Supply with `-R` when you've already created it before.
 cleaning | 10 sec | It's fast and not parallelization.
 bwa mapping | 2 min | More CPU cores with `-t`.
 interaction pair parsing | 1 min | More CPU cores with `-t`.

diff --git a/docs/build/doctrees/commandline_api.doctree b/docs/build/doctrees/commandline_api.doctree
diff --git a/docs/build/doctrees/environment.pickle b/docs/build/doctrees/environment.pickle
diff --git a/docs/build/doctrees/further_analysis.doctree b/docs/build/doctrees/further_analysis.doctree
diff --git a/docs/build/doctrees/index.doctree b/docs/build/doctrees/index.doctree
diff --git a/docs/build/doctrees/installation.doctree b/docs/build/doctrees/installation.doctree
diff --git a/docs/build/doctrees/origin_imargi_methods.doctree b/docs/build/doctrees/origin_imargi_methods.doctree
diff --git a/docs/build/doctrees/performance profile.doctree b/docs/build/doctrees/performance profile.doctree
diff --git a/docs/build/doctrees/quick_example.doctree b/docs/build/doctrees/quick_example.doctree
diff --git a/docs/build/doctrees/step_by_step_illustration.doctree b/docs/build/doctrees/step_by_step_illustration.doctree
diff --git a/docs/build/doctrees/technical_note.doctree b/docs/build/doctrees/technical_note.doctree
diff --git a/docs/build/html/.buildinfo b/docs/build/html/.buildinfo
@@ -1,4 +1,4 @@
 # Sphinx build info version 1
 # This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
-config: 893b56350ce42023ad6ecd0594af280a
+config: 25a1ecdc13a051456e7e1298a01d2ebe
 tags: 645f666f9bcd5a90fca523b33c5a78b7
diff --git a/docs/build/html/_images/docker_command_example.png b/docs/build/html/_images/docker_command_example.png
diff --git a/docs/build/html/_sources/commandline_api.md.txt b/docs/build/html/_sources/commandline_api.md.txt
@@ -28,8 +28,10 @@ We created several script tools. Here we show the usage and source code of all t
     -O : Max offset bases for filtering pairs based on R2 5' end positions to restriction sites. Default 3.
     -M : Max size of ligation fragment for sequencing. It's used for filtering unligated DNA sequence.
     -t : Max CPU threads for parallelized processing, at least 4. (Default 8)
-    -1 : R1 fastq.gz file, if there are multiple files, just separated with space, such as -1 lane1_R1.fq lane2_R1.fq
-    -2 : R2 fastq.gz file, if there are multiple files, just separated with space, such as -1 lane1_R2.fq lane2_R2.fq
+    -1 : R1 fastq.gz file, if there are multiple files, just separated with space or use wildcard,
+         such as '-1 lane1_R1.fq.gz lane2_R1.fq.gz', or '-1  lane*_R1.fq.gz'.
+    -2 : R2 fastq.gz file, if there are multiple files, just separated with space or use wildcard,
+         such as '-2 lane1_R2.fq.gz lane2_R2.fq.gz', or '-2  lane*_R2.fq.gz'.
     -o : Output directoy
     -h : Show usage help
 ```
@@ -39,30 +41,25 @@ We created several script tools. Here we show the usage and source code of all t
 [*Source Code*](https://github.com/Zhong-Lab-UCSD/iMARGI-Docker/blob/master/src/imargi_clean.sh)
 
 ``` bash
-    Usage: imargi_clean.sh [-1 <fastq.gz_R1>] [-2 <fastq.gz_R2>] [-o <output_dir>] [-f <filter_CT>] [-d <drop>] 
-                [-t <threads>] [-b <block_size>]
+    Usage: $PROGNAME [-1 <fastq.gz_R1>] [-2 <fastq.gz_R2>] [-N <base_name>] [-o <output_dir>] [-t <threads>]
 
     Dependencies: seqtk, gzip, zcat, awk, parallel
 
     This script will clean the paired reads (R1 and R2) of iMARGI sequencing Fastq data. According to the iMARGI design,
-    RNA end reads (R1) start with 2 random based, and DNA end reads (R2) of successful ligation fragments start
-    with "CT". We need to remove the first 2 bases of R1 for better mapping. For R2, We can strictly clean the data by
-    filtering out those R2 reads not starting with "CT" in this step by setting "-f CT". Alternatively, 
-    you can skip the "CT" filtering without "-f" parameter. You can apply the "CT" filtering in interaction pairs
-    filtering step.
-    If you choose to do "CT" filtering, the script also fixes the paired reads in R1. If "-d" was set as "true",
-    it will drop all the non "CT" started R2 reads and paired R1 reads, which outputs two fastq files with prefix
-    "clean_". If "-d" was "false", the filtered read pairs would also be outputed in a pair of fastq files with prefix
-    "drop_". "-d" only works when "-f" is set, and the default setting of "-d" is "false".
+    RNA end reads (R1) start with 2 random based. We need to remove the first 2 bases of R1 for better mapping. 
+    If you provided multiple input files (different lanes) in '-1' and '-2' with ',' separator or contains wildcard,
+    then the output will merge multi-lanes fastq files to one clean fastq file.
+
     The input fastq files must be gzip files, i.e., fastq.gz or fq.gz. The output files are also gzipped files fastq.gz.
 
-    -1 : R1 fastq.gz file, if there are multiple files, just separated with space, such as -1 lane1_R1.fq lane2_R1.fq
-    -2 : R2 fastq.gz file, if there are multiple files, just separated with space, such as -1 lane1_R2.fq lane2_R2.fq
+    -1 : R1 fastq.gz file, if there are multiple files, just separated with space or use wildcard,
+         such as '-1 lane1_R1.fq.gz lane2_R1.fq.gz', or '-1  lane*_R1.fq.gz'.
+    -2 : R2 fastq.gz file, if there are multiple files, just separated with space or use wildcard,
+         such as '-2 lane1_R2.fq.gz lane2_R2.fq.gz', or '-2  lane*_R2.fq.gz'.
+    -N : Base name for ouput result. Such as -N HEK_iMARGI, then output cleaned and merged fastq.gz file will be
+         renamed using the base name.
     -o : Output directoy
-    -f : Filtering sequence by 5' start of R2. If not set, no filtering applied. "CT" filtering can be set as "-f CT"
-    -d : Flag of dropping, working with "-f". Default is false, i.e., output drop_*fastq.gz files of dropped read pairs.
     -t : Max CPU threads for parallelized processing, at least 4. (Default 8)
-    -b : Fastq data block size (number of reads) for each thread. Default 2000000.
     -h : Show usage help
 ```
 
@@ -71,14 +68,10 @@ We created several script tools. Here we show the usage and source code of all t
 [*Source Code*](https://github.com/Zhong-Lab-UCSD/iMARGI-Docker/blob/master/src/imargi_rsfrags.sh)
 
 ``` bash
-    Usage: imargi_rsfrags.sh [-r <ref_fasta>] [-c <chromSize_file>] [-e <enzyme_name>] [-C <cut_position>] [-o <output_dir>] 
-                    [-g <max_inter_align_gap>] [-O offset_restriction_site] [-d <drop>] [-D <intermediate_dir>] 
-                    [-s <stats_flag>] [-t <threads>] 
-
-    Dependency: cooler
+    Usage: $PROGNAME [-r <ref_fasta>] [-c <chromSize_file>] [-e <enzyme_name>] [-C <cut_position>] [-o <output_dir>] 
 
+    Dependency: cooler
     This script use cooler digest to generate the restriction Enzyme digested fragments bed file for iMARGI
-
     -r : Reference genome fasta file
     -c : Chromosome size file.
     -e : Enzyme name, we use AluI in iMARGI.
@@ -93,9 +86,9 @@ We created several script tools. Here we show the usage and source code of all t
 [*Source Code*](https://github.com/Zhong-Lab-UCSD/iMARGI-Docker/blob/master/src/imargi_parse.sh)
 
 ``` bash
-    Usage: imargi_parse.sh [-r <ref_name>] [-c <chromSize_file>] [-R <restrict_sites>] [-b <bam_file>] [-o <output_dir>] 
-                    [-Q <min_mapq>] [-G <max_inter_align_gap>] [-O <offset_restriction_site>] [-M <max_ligation_size>]
-                    [-d <drop>] [-D <intermediate_dir>] [-t <threads>] 
+   Usage: $PROGNAME [-r <ref_name>] [-c <chromSize_file>] [-R <restrict_sites>] [-b <bam_file>] [-o <output_dir>] 
+                     [-Q <min_mapq>] [-G <max_inter_align_gap>] [-O <offset_restriction_site>] [-M <max_ligation_size>]
+                     [-d <drop>] [-D <intermediate_dir>] [-t <threads>] 
 
     Dependency: pairtools pbgzip
 
@@ -123,10 +116,9 @@ We created several script tools. Here we show the usage and source code of all t
 [*Source Code*](https://github.com/Zhong-Lab-UCSD/iMARGI-Docker/blob/master/src/imargi_stats.sh)
 
 ``` bash
-    Usage: imargi_stats.sh [-D <distance_type>] [-d <distance_threshold>] [-i <input_file>] [-o <output_file>]
+    Usage: $PROGNAME [-D <distance_type>] [-d <distance_threshold>] [-i <input_file>] [-o <output_file>]
 
     Dependency: gzip, awk
-
     This script can be used to filter out short-range intra-chromosomal interactions with a threshold genomic distance.
 
     -D : Distance type. The default genomic position in .pairs file is the 5' end position, so the default distance
@@ -138,7 +130,7 @@ We created several script tools. Here we show the usage and source code of all t
          comma ',', such as '-d 1000,2000,10000,20000,100000,1000000', then the report will include the statistics
          number with different thresholds (space is not allowed).
     -i : Input .pairs.gz file.
-    -o : Output .pairs.gz file.
+    -o : Output stats text file.
     -h : Show usage help
 ```
 
@@ -147,11 +139,10 @@ We created several script tools. Here we show the usage and source code of all t
 [*Source Code*](https://github.com/Zhong-Lab-UCSD/iMARGI-Docker/blob/master/src/imargi_distfilter.sh)
 
 ``` bash
-    Usage: imargi_distfilter.sh [-D <distance_type>] [-d <distance_threshold>] [-F <deal_with_filter>]
-                [-i <input_file>] [-o <output_file>]
+    Usage: $PROGNAME [-D <distance_type>] [-d <distance_threshold>] [-F <deal_with_filter>] [-i <input_file>] 
+                     [-o <output_file>]
 
     Dependency: gzip, awk
-
     This script can be used to filter out short-range intra-chromosomal interactions with a threshold genomic distance.
 
     -D : Distance type. The default genomic position in .pairs file is the 5' end position, so the default distance
@@ -161,7 +152,7 @@ We created several script tools. Here we show the usage and source code of all t
     -d : The distance threshold for filtering. Default is 200000 (distance <200000 will be filtered out).
     -F : How to deal with the interactions need to be filtered out? '-F' accepts 'drop' and 'output'. 'drop' means
          drop those interactions. 'output' means output an new file of all the filtered out interactions with prefix
-         'filterOut_'. Default is 'drop'.
+         'filterOut_'. Default is 'output'.
     -i : Input .pairs.gz file.
     -o : Output .pairs.gz file.
     -h : Show usage help
@@ -172,20 +163,19 @@ We created several script tools. Here we show the usage and source code of all t
 [*Source Code*](https://github.com/Zhong-Lab-UCSD/iMARGI-Docker/blob/master/src/imargi_convert.sh)
 
 ``` bash
-    Usage: imargi_convert.sh [-f <file_format>] [-k <keep_cols>] [-b <bin_size>] [-i <input_file>] [-o <output_file>]
+    Usage: $PROGNAME [-f <file_format>] [-k <keep_cols>] [-b <bin_size>] [-i <input_file>] [-o <output_file>] 
 
     Dependency: gzip, awk, cool
-
     This script can convert .pairs format to BEDPE, .cool, and GIVE interaction format.
-
     -f : The target format, only accept 'cool', 'bedpe' and 'give'. For 'cool', it will generate
          a ".cool" file with defined resolution of -b option and a multi-resolution ".mcool" file
-         based on the ".cool" file.
+         based on the ".cool" file. For 'bedpe', the output will be pbgzip compressed file. So
+         keep in mind to name the output_file '-o' with '.gz' extesion.
     -k : Keep extra information column in BEDPE. Columns ids in .pairs file you want to keep.
          For example, 'cigar1,cigar2'. Default value is "", i.e., drop all extra cols.
     -b : bin size for cool format. Default is 5000.
     -i : Input file.
-    -o : Output file.
+    -o : Output file. BEDPE output is gzip compressed file. cool output are .cool and .mcool files.
     -h : Show usage help
 ```
 
@@ -200,7 +190,7 @@ We created several script tools. Here we show the usage and source code of all t
                 [-m <min_overlap>] [-G <cigar>]
                 [-t <threads>] [-i <input_file>] [-o <output_file>]
 
-    Dependency: gzip, awk, cool, BEDOPS
+    Dependency: gzip, pairtools, lz4 pbgzip
 
     This script can annotate both RNA and DNA ends with gene annotations in GTF/GFF format or any other genomic
     features in a simple BED file (each line is a named genomic feature). Multiple overlapped annotation features are

diff --git a/docs/build/html/_sources/further_analysis.md.txt b/docs/build/html/_sources/further_analysis.md.txt
@@ -18,7 +18,7 @@ statistics, such as number of intra- and inter-chromosomal interactions. We prov
 simple text data statistics report. The example command is:
 
 ``` bash
-docker run -v ~/imargi_example:/imargi imargi imargi_stats.sh \
+docker run -v ~/imargi_example:/imargi zhonglab/imargi imargi_stats.sh \
     -D 5end \
     -d 200000 \
     -i ./output/final_HEK_iMARGI.pairs.gz \
@@ -48,7 +48,7 @@ interactions with a distance threshold, which depends on the requirements of fur
 `imargi_distfilter.sh` tool for filtering interaction based on interaction genomic distance. The example command is:
 
 ``` bash
-docker run -v ~/imargi_example:/imargi imargi imargi_distfilter.sh \
+docker run -v ~/imargi_example:/imargi zhonglab/imargi imargi_distfilter.sh \
     -D 5end \
     -d 20000 \
     -i ./output/final_HEK_iMARGI.pairs.gz \
@@ -68,7 +68,7 @@ GTF/GFF format or any other genomic features in a simple BED file (each line is
 command below will generate two new gene annotation columns named as gene1 and gene2 in the output .pairs format file.
 
 ``` bash
-docker run -v ~/imargi_example:/imargi imargi imargi_annotate.sh \
+docker run -v ~/imargi_example:/imargi zhonglab/imargi imargi_annotate.sh \
     -A gtf \
     -a ./ref/gencode.v24.annotation.gtf \
     -l gene \
@@ -111,7 +111,7 @@ For further analysis and visualization, other formats instead of .pairs format m
 with different `-f` argument options. The example command is below:
 
 ``` bash
-docker run -v ~/imargi_example:/imargi imargi imargi_convert.sh \
+docker run -v ~/imargi_example:/imargi zhonglab/imargi imargi_convert.sh \
     -f bedpe \
     -i ./output/final_HEK_iMARGI.pairs.gz \
     -o ./output/final_HEK_iMARGI.bedpe.gz

diff --git a/docs/build/html/_sources/index.md.txt b/docs/build/html/_sources/index.md.txt
@@ -78,7 +78,7 @@ The [Technical Notes](./technical_note.md) section shows more technical informat
 
 <small>[[1]](#a1) <span id="f1"></span> Sridhar, B. et al. Systematic Mapping of RNA-Chromatin Interactions In Vivo. Current Biology 27, 602–609 (2017).</small>
 
-<small>[[2]](#a2) <span id="f2"></span> Yan, Z. et al. Genome-wide co-localization of RNA-DNA interactions and fusion RNA pairs. bioRxiv 472019 (2018). doi:10.1101/472019</small>
+<small>[[2]](#a2) <span id="f2"></span> Yan, Z. et al. Genome-wide co-localization of RNA-DNA interactions and fusion RNA pairs. PNAS February 19, 2019, 116 (8) 3328-3337. https://doi.org/10.1073/pnas.1819788116 </small>
 
 <small>[[3]](#a3) <span id="f3"></span> Wu, W., Yan, Z., Wen X. & Zhong, S. iMARGI: Mapping RNA-DNA interactome by sequencing.</small>