Skip to content
skrakau edited this page May 25, 2017 · 41 revisions

Welcome to the PureCLIP wiki!

Let's use the PUM2 eCLIP data from ENCODE (Van Nostrand et. al, 2016), preprocessed as described in the previous step:

PureCLIP

~/pureclip -i pum2.aligned.pooled.R2.bam -bai pum2.aligned.pooled.R2.bam.bai -g ref.fasta -o PureCLIP.crosslink_sites.bed -nt 10 -iv '1;2;3;'

With --iv the chromosomes (or transcripts) can be specified that are used to learn the parameters of PureCLIPs HMM. This reduces the memory consumption and runtime. Usually, learning on a small subset of the chromosomes, e.g. Chr1-3, does not impair the results noticeable. However, in the case of very sparse data this can be adjusted.

PureCLIP incorporating input control experiments

~/pureclip -i {input.bam} -bai {input.bai} -g {input.ref} -o {output.states} -nt 10 -iv '1;2;3;' -g1g2k -ibam '/project/lincRNA_seq/eCLIP/PUM2_new/input_reads/STAR/Aligned.f.duplRm.so.R2.bam' -ibai '/project/lincRNA_seq/eCLIP/PUM2_new/input_reads/STAR/Aligned.f.duplRm.so.R2.bam.bai'

PureCLIP incorporating CL-motif scores

In order to incorporate CL-motifs into the model of PureCLIP, first we need to: You can skip the first two points and use the provided precompiled list of common CL-motifs. However, it might be more accurate to use the CL-motifs specific to the used eCLIP experiment.

  1. Detect crosslink sites within an input control experiment:

  2. Learn CL-motifs on these sites using DREME [cite]

  3. Use FIMO to compute motif occurrences associated with a score within your reference.

    ~/pureclip -i {input.bam} -bai {input.bai} -g {input.ref} -o {output.states} -nt 10 -iv '1;2;3;' -nim 4 -fis '/project/lincRNA_seq/eCLIP/PUM2_new/FIMO_CL_MOTIFS/hmm_1_HMM_SET0.top5000_matches_thresh0.01/fimo.final.w10.m4.bed'

PureCLIPs output

The main output of PureCLIP is a BED6 file, containing individual crosslink sites together with a score:

chromosome, start, start+1, state=3, score, strand

Optionally, if an output file for binding regions is specified with --or, individual crosslink sites with a distance <= d (specified with --dm) are merged and given out in a separate BED7 file:

chromosome, start, end, ., score, strand, score1;score2;score3;

where the 5th column is the sum of the crosslink site scores, while the 7th column contains the individiual crosslink scores.

Clone this wiki locally