Skip to content

Commit

Permalink
release v1.0 freeze
Browse files Browse the repository at this point in the history
  • Loading branch information
frankyan committed Feb 27, 2019
1 parent 306cfdd commit a5807e7
Show file tree
Hide file tree
Showing 56 changed files with 1,058 additions and 937 deletions.
61 changes: 48 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,9 @@ iMARGI-Docker distributes the iMARGI sequencing data processing pipeline
- [Build with Dockerfile](#build-with-dockerfile)
- [Software Testing Demo](#software-testing-demo)
- [Testing Data](#testing-data)
- [iMARGI sequencing data (paired FASTQ)](#imargi-sequencing-data-paired-fastq)
- [Reference genome data (FASTA)](#reference-genome-data-fasta)
- [bwa index data](#bwa-index-data)
- [Testing Command](#testing-command)
- [Testing Results](#testing-results)
- [Running Time Profile](#running-time-profile)
Expand Down Expand Up @@ -106,19 +109,37 @@ To test whether you have successfully installed iMARGI-Docker, you can follow in

### Testing Data

#### iMARGI sequencing data (paired FASTQ)

As real iMARGI sequencing data are always very big, so we randomly extracted a small chunk of real data for software
testing. The data can be found in [`data`](./data/) folder. Please download them to your computer.

Besides, you need to download a human genome reference FASTA file.
- [R1 reads](https://github.com/Zhong-Lab-UCSD/iMARGI-Docker/raw/master/data/sample_R1.fastq.gz)
- [R2 reads](https://github.com/Zhong-Lab-UCSD/iMARGI-Docker/raw/master/data/sample_R2.fastq.gz)

#### Reference genome data (FASTA)

Besides, you need to download a human genome reference FASTA file.
We use the reference genome used by
[4D Nucleome](https://www.4dnucleome.org/) and
[ENCODE project](https://www.encodeproject.org/data-standards/reference-sequences/). The FASTA file of the reference
[ENCODE project](https://www.encodeproject.org/data-standards/reference-sequences/).

The FASTA file of the reference
genome is too large for us to host it in GitHub repo. You can be download it use the link:
[GRCh38_no_alt_analysis_set_GCA_000001405.15](https://www.encodeproject.org/files/GRCh38_no_alt_analysis_set_GCA_000001405.15/@@download/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz).

- [GRCh38_no_alt_analysis_set_GCA_000001405.15](https://www.encodeproject.org/files/GRCh38_no_alt_analysis_set_GCA_000001405.15/@@download/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz)

It needs to be decompressed using `gunzip -d` command on Linux/MacOS. If your system is Windows, you can use 7Zip
software to decompress the `.gz` file. Besides, you can also use the `gunzip` tool delivered in iMARGI-Docker.

We assume that you put the data and reference files in the following directory structure.
#### bwa index data

As `bwa index` process will cost a lot of time (more than 1 hour), we suggest to download our pre-built index files for the reference
genome. Please download the following gzip compressed `bwa_index` folder and decompress it (`tar zxvf`) on your machine.

- [bwa index files](https://sysbio.ucsd.edu/imargi_pipeline/bwa_index.tar.gz)

*We assume that you put the data and reference files in the following directory structure.*

``` bash
~/imargi_example
Expand All @@ -127,7 +148,13 @@ We assume that you put the data and reference files in the following directory s
│   └── sample_R2.fastq.gz
├── output
└── ref
└── GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta
   ├── GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta
└── bwa_index
   ├── bwa_index_hg38.amb
   ├── bwa_index_hg38.ann
   ├── bwa_index_hg38.bwt
    ├── sample_R1.fastq.pac
   └── sample_R2.fastq.sa
```

### Testing Command
Expand All @@ -141,6 +168,7 @@ docker run -u 1043 -v ~/imargi_example:/imargi zhonglab/imargi \
-N test_sample \
-t 4 \
-g ./ref/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta \
-i ./ref/bwa_index/bwa_index_hg38 \
-1 ./data/sample_R1.fastq.gz \
-2 ./data/sample_R2.fastq.gz \
-o ./output
Expand All @@ -160,23 +188,30 @@ docker run -u 1043 -v ~/imargi_example:/imargi zhonglab/imargi \
- The command line is long, so `\` was used for splitting it into multiple lines in the example. It's a Linux or MacOS
style. However, in Windows, you need to replace `\` with `^`.

- Building bwa index costs the most running time. If you have human genome bwa index built before, you can supply it
with `-i` argument. See more details in the
- `-i`: Building bwa index will cost a lot time, so we used the pre-built index files with `-i` argument. There
are some other arguments can be used for pre-generated files, such as `-R` for restriction fragment BED file and
`-c` for chromsize file.See more details in the
[documentation of command line API section](https://sysbio.ucsd.edu/imargi_pipeline/commandline_api.html#imargi-wrapper-sh)
`

- `-i`: If you don't supply bwa index files, the `imargi_wrapper.sh` will generated it automatically. It works
perfectly on Linux system. However, it doesn't work on Windows and MacOS because `bwa index` use `fsync` when build
large genome index, which cannot handle different driver formats (`-v` mount Windows/MacOS driver to Linux container).
So it's better to build it in advance. In fact, there's a solution to the problem if you are familiar with Docker
volume. Please read the
[technical note of iMARGI pipeline documentation](https://sysbio.ucsd.edu/imargi_pipeline/technical_note.html#solve-bwa-index-failure-problem) for
detail.

### Testing Results

#### Running Time Profile

It took about 85 minutes to perform the pipeline. The most of time (75 min) was consumed by building bwa index files.
So once you built the bwa index, supply it to the command with `-i` next time.
It took about 15 minutes to perform the pipeline (with `-i` bwa index argument).

Step | Time | Speed up suggestion
---------|----------|----------
Generating chromosome size file | 10 sec | It's fast, but you can supply with `-c` once you've generated it.
Generating bwa index | 75 min | Supply with `-i` once you've built it.
Generating restriction fragment file | 4 min | Supply with `-R` once you've created it.
Generating chromosome size file | 10 sec | It's fast, but you can also supply with `-c` once you've generated it before.
Generating bwa index (skipped) | 75 min | Supply with `-i` if you've pre-built index files.
Generating restriction fragment file | 4 min | Supply with `-R` when you've already created it before.
cleaning | 10 sec | It's fast and not parallelization.
bwa mapping | 2 min | More CPU cores with `-t`.
interaction pair parsing | 1 min | More CPU cores with `-t`.
Expand Down
Binary file modified docs/build/doctrees/commandline_api.doctree
Binary file not shown.
Binary file modified docs/build/doctrees/environment.pickle
Binary file not shown.
Binary file modified docs/build/doctrees/further_analysis.doctree
Binary file not shown.
Binary file modified docs/build/doctrees/index.doctree
Binary file not shown.
Binary file modified docs/build/doctrees/installation.doctree
Binary file not shown.
Binary file modified docs/build/doctrees/origin_imargi_methods.doctree
Binary file not shown.
Binary file modified docs/build/doctrees/performance profile.doctree
Binary file not shown.
Binary file modified docs/build/doctrees/quick_example.doctree
Binary file not shown.
Binary file modified docs/build/doctrees/step_by_step_illustration.doctree
Binary file not shown.
Binary file modified docs/build/doctrees/technical_note.doctree
Binary file not shown.
2 changes: 1 addition & 1 deletion docs/build/html/.buildinfo
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: 893b56350ce42023ad6ecd0594af280a
config: 25a1ecdc13a051456e7e1298a01d2ebe
tags: 645f666f9bcd5a90fca523b33c5a78b7
Binary file modified docs/build/html/_images/docker_command_example.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
70 changes: 30 additions & 40 deletions docs/build/html/_sources/commandline_api.md.txt
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,10 @@ We created several script tools. Here we show the usage and source code of all t
-O : Max offset bases for filtering pairs based on R2 5' end positions to restriction sites. Default 3.
-M : Max size of ligation fragment for sequencing. It's used for filtering unligated DNA sequence.
-t : Max CPU threads for parallelized processing, at least 4. (Default 8)
-1 : R1 fastq.gz file, if there are multiple files, just separated with space, such as -1 lane1_R1.fq lane2_R1.fq
-2 : R2 fastq.gz file, if there are multiple files, just separated with space, such as -1 lane1_R2.fq lane2_R2.fq
-1 : R1 fastq.gz file, if there are multiple files, just separated with space or use wildcard,
such as '-1 lane1_R1.fq.gz lane2_R1.fq.gz', or '-1 lane*_R1.fq.gz'.
-2 : R2 fastq.gz file, if there are multiple files, just separated with space or use wildcard,
such as '-2 lane1_R2.fq.gz lane2_R2.fq.gz', or '-2 lane*_R2.fq.gz'.
-o : Output directoy
-h : Show usage help
```
Expand All @@ -39,30 +41,25 @@ We created several script tools. Here we show the usage and source code of all t
[*Source Code*](https://github.com/Zhong-Lab-UCSD/iMARGI-Docker/blob/master/src/imargi_clean.sh)

``` bash
Usage: imargi_clean.sh [-1 <fastq.gz_R1>] [-2 <fastq.gz_R2>] [-o <output_dir>] [-f <filter_CT>] [-d <drop>]
[-t <threads>] [-b <block_size>]
Usage: $PROGNAME [-1 <fastq.gz_R1>] [-2 <fastq.gz_R2>] [-N <base_name>] [-o <output_dir>] [-t <threads>]

Dependencies: seqtk, gzip, zcat, awk, parallel

This script will clean the paired reads (R1 and R2) of iMARGI sequencing Fastq data. According to the iMARGI design,
RNA end reads (R1) start with 2 random based, and DNA end reads (R2) of successful ligation fragments start
with "CT". We need to remove the first 2 bases of R1 for better mapping. For R2, We can strictly clean the data by
filtering out those R2 reads not starting with "CT" in this step by setting "-f CT". Alternatively,
you can skip the "CT" filtering without "-f" parameter. You can apply the "CT" filtering in interaction pairs
filtering step.
If you choose to do "CT" filtering, the script also fixes the paired reads in R1. If "-d" was set as "true",
it will drop all the non "CT" started R2 reads and paired R1 reads, which outputs two fastq files with prefix
"clean_". If "-d" was "false", the filtered read pairs would also be outputed in a pair of fastq files with prefix
"drop_". "-d" only works when "-f" is set, and the default setting of "-d" is "false".
RNA end reads (R1) start with 2 random based. We need to remove the first 2 bases of R1 for better mapping.
If you provided multiple input files (different lanes) in '-1' and '-2' with ',' separator or contains wildcard,
then the output will merge multi-lanes fastq files to one clean fastq file.

The input fastq files must be gzip files, i.e., fastq.gz or fq.gz. The output files are also gzipped files fastq.gz.

-1 : R1 fastq.gz file, if there are multiple files, just separated with space, such as -1 lane1_R1.fq lane2_R1.fq
-2 : R2 fastq.gz file, if there are multiple files, just separated with space, such as -1 lane1_R2.fq lane2_R2.fq
-1 : R1 fastq.gz file, if there are multiple files, just separated with space or use wildcard,
such as '-1 lane1_R1.fq.gz lane2_R1.fq.gz', or '-1 lane*_R1.fq.gz'.
-2 : R2 fastq.gz file, if there are multiple files, just separated with space or use wildcard,
such as '-2 lane1_R2.fq.gz lane2_R2.fq.gz', or '-2 lane*_R2.fq.gz'.
-N : Base name for ouput result. Such as -N HEK_iMARGI, then output cleaned and merged fastq.gz file will be
renamed using the base name.
-o : Output directoy
-f : Filtering sequence by 5' start of R2. If not set, no filtering applied. "CT" filtering can be set as "-f CT"
-d : Flag of dropping, working with "-f". Default is false, i.e., output drop_*fastq.gz files of dropped read pairs.
-t : Max CPU threads for parallelized processing, at least 4. (Default 8)
-b : Fastq data block size (number of reads) for each thread. Default 2000000.
-h : Show usage help
```

Expand All @@ -71,14 +68,10 @@ We created several script tools. Here we show the usage and source code of all t
[*Source Code*](https://github.com/Zhong-Lab-UCSD/iMARGI-Docker/blob/master/src/imargi_rsfrags.sh)

``` bash
Usage: imargi_rsfrags.sh [-r <ref_fasta>] [-c <chromSize_file>] [-e <enzyme_name>] [-C <cut_position>] [-o <output_dir>]
[-g <max_inter_align_gap>] [-O offset_restriction_site] [-d <drop>] [-D <intermediate_dir>]
[-s <stats_flag>] [-t <threads>]

Dependency: cooler
Usage: $PROGNAME [-r <ref_fasta>] [-c <chromSize_file>] [-e <enzyme_name>] [-C <cut_position>] [-o <output_dir>]

Dependency: cooler
This script use cooler digest to generate the restriction Enzyme digested fragments bed file for iMARGI

-r : Reference genome fasta file
-c : Chromosome size file.
-e : Enzyme name, we use AluI in iMARGI.
Expand All @@ -93,9 +86,9 @@ We created several script tools. Here we show the usage and source code of all t
[*Source Code*](https://github.com/Zhong-Lab-UCSD/iMARGI-Docker/blob/master/src/imargi_parse.sh)

``` bash
Usage: imargi_parse.sh [-r <ref_name>] [-c <chromSize_file>] [-R <restrict_sites>] [-b <bam_file>] [-o <output_dir>]
[-Q <min_mapq>] [-G <max_inter_align_gap>] [-O <offset_restriction_site>] [-M <max_ligation_size>]
[-d <drop>] [-D <intermediate_dir>] [-t <threads>]
Usage: $PROGNAME [-r <ref_name>] [-c <chromSize_file>] [-R <restrict_sites>] [-b <bam_file>] [-o <output_dir>]
[-Q <min_mapq>] [-G <max_inter_align_gap>] [-O <offset_restriction_site>] [-M <max_ligation_size>]
[-d <drop>] [-D <intermediate_dir>] [-t <threads>]

Dependency: pairtools pbgzip

Expand Down Expand Up @@ -123,10 +116,9 @@ We created several script tools. Here we show the usage and source code of all t
[*Source Code*](https://github.com/Zhong-Lab-UCSD/iMARGI-Docker/blob/master/src/imargi_stats.sh)

``` bash
Usage: imargi_stats.sh [-D <distance_type>] [-d <distance_threshold>] [-i <input_file>] [-o <output_file>]
Usage: $PROGNAME [-D <distance_type>] [-d <distance_threshold>] [-i <input_file>] [-o <output_file>]

Dependency: gzip, awk

This script can be used to filter out short-range intra-chromosomal interactions with a threshold genomic distance.

-D : Distance type. The default genomic position in .pairs file is the 5' end position, so the default distance
Expand All @@ -138,7 +130,7 @@ We created several script tools. Here we show the usage and source code of all t
comma ',', such as '-d 1000,2000,10000,20000,100000,1000000', then the report will include the statistics
number with different thresholds (space is not allowed).
-i : Input .pairs.gz file.
-o : Output .pairs.gz file.
-o : Output stats text file.
-h : Show usage help
```

Expand All @@ -147,11 +139,10 @@ We created several script tools. Here we show the usage and source code of all t
[*Source Code*](https://github.com/Zhong-Lab-UCSD/iMARGI-Docker/blob/master/src/imargi_distfilter.sh)

``` bash
Usage: imargi_distfilter.sh [-D <distance_type>] [-d <distance_threshold>] [-F <deal_with_filter>]
[-i <input_file>] [-o <output_file>]
Usage: $PROGNAME [-D <distance_type>] [-d <distance_threshold>] [-F <deal_with_filter>] [-i <input_file>]
[-o <output_file>]

Dependency: gzip, awk

This script can be used to filter out short-range intra-chromosomal interactions with a threshold genomic distance.

-D : Distance type. The default genomic position in .pairs file is the 5' end position, so the default distance
Expand All @@ -161,7 +152,7 @@ We created several script tools. Here we show the usage and source code of all t
-d : The distance threshold for filtering. Default is 200000 (distance <200000 will be filtered out).
-F : How to deal with the interactions need to be filtered out? '-F' accepts 'drop' and 'output'. 'drop' means
drop those interactions. 'output' means output an new file of all the filtered out interactions with prefix
'filterOut_'. Default is 'drop'.
'filterOut_'. Default is 'output'.
-i : Input .pairs.gz file.
-o : Output .pairs.gz file.
-h : Show usage help
Expand All @@ -172,20 +163,19 @@ We created several script tools. Here we show the usage and source code of all t
[*Source Code*](https://github.com/Zhong-Lab-UCSD/iMARGI-Docker/blob/master/src/imargi_convert.sh)

``` bash
Usage: imargi_convert.sh [-f <file_format>] [-k <keep_cols>] [-b <bin_size>] [-i <input_file>] [-o <output_file>]
Usage: $PROGNAME [-f <file_format>] [-k <keep_cols>] [-b <bin_size>] [-i <input_file>] [-o <output_file>]

Dependency: gzip, awk, cool

This script can convert .pairs format to BEDPE, .cool, and GIVE interaction format.

-f : The target format, only accept 'cool', 'bedpe' and 'give'. For 'cool', it will generate
a ".cool" file with defined resolution of -b option and a multi-resolution ".mcool" file
based on the ".cool" file.
based on the ".cool" file. For 'bedpe', the output will be pbgzip compressed file. So
keep in mind to name the output_file '-o' with '.gz' extesion.
-k : Keep extra information column in BEDPE. Columns ids in .pairs file you want to keep.
For example, 'cigar1,cigar2'. Default value is "", i.e., drop all extra cols.
-b : bin size for cool format. Default is 5000.
-i : Input file.
-o : Output file.
-o : Output file. BEDPE output is gzip compressed file. cool output are .cool and .mcool files.
-h : Show usage help
```

Expand All @@ -200,7 +190,7 @@ We created several script tools. Here we show the usage and source code of all t
[-m <min_overlap>] [-G <cigar>]
[-t <threads>] [-i <input_file>] [-o <output_file>]

Dependency: gzip, awk, cool, BEDOPS
Dependency: gzip, pairtools, lz4 pbgzip

This script can annotate both RNA and DNA ends with gene annotations in GTF/GFF format or any other genomic
features in a simple BED file (each line is a named genomic feature). Multiple overlapped annotation features are
Expand Down
8 changes: 4 additions & 4 deletions docs/build/html/_sources/further_analysis.md.txt
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ statistics, such as number of intra- and inter-chromosomal interactions. We prov
simple text data statistics report. The example command is:

``` bash
docker run -v ~/imargi_example:/imargi imargi imargi_stats.sh \
docker run -v ~/imargi_example:/imargi zhonglab/imargi imargi_stats.sh \
-D 5end \
-d 200000 \
-i ./output/final_HEK_iMARGI.pairs.gz \
Expand Down Expand Up @@ -48,7 +48,7 @@ interactions with a distance threshold, which depends on the requirements of fur
`imargi_distfilter.sh` tool for filtering interaction based on interaction genomic distance. The example command is:

``` bash
docker run -v ~/imargi_example:/imargi imargi imargi_distfilter.sh \
docker run -v ~/imargi_example:/imargi zhonglab/imargi imargi_distfilter.sh \
-D 5end \
-d 20000 \
-i ./output/final_HEK_iMARGI.pairs.gz \
Expand All @@ -68,7 +68,7 @@ GTF/GFF format or any other genomic features in a simple BED file (each line is
command below will generate two new gene annotation columns named as gene1 and gene2 in the output .pairs format file.

``` bash
docker run -v ~/imargi_example:/imargi imargi imargi_annotate.sh \
docker run -v ~/imargi_example:/imargi zhonglab/imargi imargi_annotate.sh \
-A gtf \
-a ./ref/gencode.v24.annotation.gtf \
-l gene \
Expand Down Expand Up @@ -111,7 +111,7 @@ For further analysis and visualization, other formats instead of .pairs format m
with different `-f` argument options. The example command is below:

``` bash
docker run -v ~/imargi_example:/imargi imargi imargi_convert.sh \
docker run -v ~/imargi_example:/imargi zhonglab/imargi imargi_convert.sh \
-f bedpe \
-i ./output/final_HEK_iMARGI.pairs.gz \
-o ./output/final_HEK_iMARGI.bedpe.gz
Expand Down
2 changes: 1 addition & 1 deletion docs/build/html/_sources/index.md.txt
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ The [Technical Notes](./technical_note.md) section shows more technical informat

<small>[[1]](#a1) <span id="f1"></span> Sridhar, B. et al. Systematic Mapping of RNA-Chromatin Interactions In Vivo. Current Biology 27, 602–609 (2017).</small>

<small>[[2]](#a2) <span id="f2"></span> Yan, Z. et al. Genome-wide co-localization of RNA-DNA interactions and fusion RNA pairs. bioRxiv 472019 (2018). doi:10.1101/472019</small>
<small>[[2]](#a2) <span id="f2"></span> Yan, Z. et al. Genome-wide co-localization of RNA-DNA interactions and fusion RNA pairs. PNAS February 19, 2019, 116 (8) 3328-3337. https://doi.org/10.1073/pnas.1819788116 </small>

<small>[[3]](#a3) <span id="f3"></span> Wu, W., Yan, Z., Wen X. & Zhong, S. iMARGI: Mapping RNA-DNA interactome by sequencing.</small>

Expand Down
Loading

0 comments on commit a5807e7

Please sign in to comment.