-
Notifications
You must be signed in to change notification settings - Fork 9
Home
Welcome to the PureCLIP wiki!
Let's use the PUM2 eCLIP data from ENCODE (Van Nostrand et. al, 2016), preprocessed as described in the previous step:
~/pureclip -i pum2.aligned.pooled.R2.bam -bai pum2.aligned.pooled.R2.bam.bai -g ref.fasta -o PureCLIP.crosslink_sites.bed -nt 10 -iv '1;2;3;'
With --iv the chromosomes (or transcripts) can be specified that are used to learn the parameters of PureCLIPs HMM. This reduces the memory consumption and runtime. Usually, learning on a small subset of the chromosomes, e.g. Chr1-3, does not impair the results noticeable. However, in the case of very sparse data this can be adjusted.
~/pureclip -i {input.bam} -bai {input.bai} -g {input.ref} -o {output.states} -nt 10 -iv '1;2;3;' -g1g2k -ibam '/project/lincRNA_seq/eCLIP/PUM2_new/input_reads/STAR/Aligned.f.duplRm.so.R2.bam' -ibai '/project/lincRNA_seq/eCLIP/PUM2_new/input_reads/STAR/Aligned.f.duplRm.so.R2.bam.bai'
In order to incorporate CL-motifs into the model of PureCLIP, first we need to: You can skip the first two points and use the provided precompiled list of common CL-motifs. However, it might be more accurate to use the CL-motifs specific to the used eCLIP experiment.
-
Detect crosslink sites within an input control experiment:
-
Learn CL-motifs on these sites using DREME [cite]
-
Use FIMO to compute motif occurrences associated with a score within your reference.
~/pureclip -i {input.bam} -bai {input.bai} -g {input.ref} -o {output.states} -nt 10 -iv '1;2;3;' -nim 4 -fis '/project/lincRNA_seq/eCLIP/PUM2_new/FIMO_CL_MOTIFS/hmm_1_HMM_SET0.top5000_matches_thresh0.01/fimo.final.w10.m4.bed'
The main output of PureCLIP is a BED6 file, containing individual crosslink sites together with a score:
chromosome, start, start+1, state=3, score, strand
Optionally, if an output file for binding regions is specified with --or, individual crosslink sites with a distance <= d (specified with --dm) are merged and given out in a separate BED7 file:
chromosome, start, end, ., score, strand, score1;score2;score3;
where the 5th column is the sum of the crosslink site scores, while the 7th column contains the individiual crosslink scores.