-
Notifications
You must be signed in to change notification settings - Fork 9
Home
Welcome to the PureCLIP wiki!
PureCLIP is a tool to detect protein-RNA interaction footprints from single-nucleotide CLIP-seq data, such as iCLIP and eCLIP.
Before you start, please check
We use the PUM2 eCLIP data from ENCODE (Van Nostrand et. al, 2016), preprocessed as described in the previous step. Alternatively for testing you can download the preprocessed ENCODE data (would need extract R2 !).
PureCLIP starts with mapped reads, to be precise it assumes only reads containing information about potential truncation events: R1 for iCLIP data and R2 for eCLIP data.
~/pureclip -i pum2.aligned.pooled.R2.bam -bai pum2.aligned.pooled.R2.bam.bai -g ref.fasta -o PureCLIP.crosslink_sites.bed -nt 10 -iv '1;2;3;'
With --iv the chromosomes (or transcripts) can be specified that are used to learn the parameters of PureCLIPs HMM. This reduces the memory consumption and runtime. Usually, learning on a small subset of the chromosomes, e.g. Chr1-3, does not impair the results noticeable. However, in the case of very sparse data this can be adjusted.
In order run PureCLIP with input control experiments, additionally hand over the (preprocessed) BAM file from the input experiment with -ibam and the associated BAI file with -bai:
~/pureclip -i pum2.aligned.pooled.R2.bam -bai pum2.aligned.pooled.R2.bam.bai -g hg19_ref.fasta -o PureCLIP.crosslink_sites.cov_inputSignal.bed -nt 10 -iv '1;2;3;' -g1g2k -ibam input_pum2.aligned.pooled.R2.bam -ibai input_pum2.aligned.pooled.R2.bam.bai
In order to incorporate CL-motifs into the model of PureCLIP, first we need to compute position-wise CL-motif scores, indicating the positions CL-affinity. You can use the provided precompiled list of common_CL-motifs. If you want to compute the CL-motifs specific to the used eCLIP experiment, please have a look here. Assume we have give a set of CL-motifs, we need to use FIMO (Grant et. al, 2011) to compute motif occurrences associated with a score within your reference. The following script runs FIMO to computes position-wise CL-motif scores by using only regions that are covered with CLIP-seq reads
~/compute_CLmotif_scores.sh ref.fasta alignments.bam motifs.xml motifs.txt temp_dir/ fimo_clmotif_occurences.bed
The computed scores are then handed over to PureCLIP, together with the parameter -nim 4, indicating that scores with associated motif IDs 1-4 will be used (Default: only scores with motif ID 1 will be used).
~/pureclip -i pum2.aligned.pooled.R2.bam -bai pum2.aligned.pooled.R2.bam.bai -g hg19_ref.fasta -o PureCLIP.crosslink_sites.cov_CLmotifs.bed -nt 10 -iv '1;2;3;' -nim 4 -fis fimo_clmotif_occurences.bed
The main output of PureCLIP is a BED6 file, containing individual crosslink sites together with a score:
chromosome, start, start+1, state=3, score, strand
Optionally, if an output file for binding regions is specified with --or, individual crosslink sites with a distance <= d (specified with --dm) are merged and given out in a separate BED7 file:
chromosome, start, end, ., score, strand, score1;score2;score3;
where the 5th column is the sum of the crosslink site scores, while the 7th column contains the individiual crosslink scores.