-
Notifications
You must be signed in to change notification settings - Fork 16
/
README
203 lines (146 loc) · 8.64 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
#########################################################################
diffReps: Detecting differential chromatin modification sites
from ChIP-seq data with biological replicates
by Li Shen, Ph.D.
Assistant Professor
Department of Neuroscience
Mount Sinai School of Medicine
New York, New York, U.S.A.
Created: Jun. 2011
Last modified: Mar 29, 2012
For most updated information, visit: https://code.google.com/p/diffreps/
#########################################################################
INTRODUCTION
ChIP-seq is now widely used to profile the enrichment of a DNA-binding protein
on a genome. It is of high interest to compare the binding differences of a
histone mark or transcription factor between two contrasting conditions, such as
disease vs. control. diffReps is developed to serve this purpose. It scans the
whole genome using a sliding window, performing millions of statistical tests
and report the significant hits. diffReps takes into account the biological
variations within a group of samples and uses that information to enhance the
statistical power. Considering biological variation is of high importance,
especiallly for in vivo brain tissues (which is my group's high priority).
PREREQUISITES
diffReps requires the following two CPAN modules:
Statistics::TTest
Math::CDF
they can be downloaded and installed from CPAN. If you use cpanminus to install
diffReps, they will be automatically installed.
Some systems have reported missing package Time::HiRes. If that is the case, it
can be installed from CPAN or your choice of package manager. To test if it is
already installed, use "perldoc Time::HiRes".
If you want to use the annotation tool - region_analysis.pl, it will also call
two external programs, refgene_getnearestgene and intersectBed. They are developed
by other researchers (see CREDITS). The former is already included in the bin/
directory and should be installed already. The intersectBed is from BEDTools
package and it can be downloaded from: http://code.google.com/p/bedtools/.
INSTALLATION
Installing diffReps is just like a standard PERL module. Basically you extract
the package downloaded, go to the program directory and type the following
commands:
perl Makefile.PL (Optional, PREFIX=your_perl_directory)
make
make test
make install
If you have root privileges, diffReps.pl will most likely be installed in
/usr/bin/. If you specified PREFIX in Makefile, it will be installed in
your_perl_directory/bin. Add your_perl_directory/bin to your PATH environmental
variable, or copy diffReps.pl from your_perl_directory/bin to a directory that
is already in PATH, such as /home/yourname/bin.
Alternative. If you have cpanminus installed, you can also install diffReps with
one line command
cpanm diffReps-XXX.tar.gz
it will try to satisfy all the dependencies for you.
CHROMOSOME LENGTHS
It is important to supply diffReps with chromosome length information. diffReps
requires that to bin the chromosomes into smaller sections. diffReps has a few
genomes built-in so what you need to do is just give a genome name, such as mm9
or hg19. If the genome you are interested in is not already defined, you can
give a text file for chorosome lengths. An example input is like
chr1 197195431
chr2 181748086
chr3 159599782
...
A NOTE ABOUT STATISTICAL TESTS
When you have biological replicates, Negative Binomial(NB) is the recommended
test for differential analysis. An exact NB test is implemented in diffReps.
Because NB distribution models discrete count data and over-dispersion among
different samples, it appears to be an ideal model for ChIP-seq data. Many
studies to date have used T-test on normalized counts for differential analysis.
However, this is sub-optimal because normalized counts are NOT Normally
distributed! As a result, detection power can be significantly degraded. Another
caveat about T-test is that regions with very small counts may be picked up.
Those regions should never pass cutoff because they don't have statistical
significance. T-test ignores this fact because it simply treats them as
continuous values. I still provide T-test in diffReps just for comparison
purpose.
If your experiment doesn't contain biological replicates, you can choose between
G-test and Chi-square test for differential analysis. They both give similar
results but G-test is more recommended and has gained its popularity recently.
See "http://en.wikipedia.org/wiki/G-test" for explanation. When they are chosen,
diffReps performs a goodness-of-fit test on the normalized counts of treatment
and control groups.
You can also use G-test or Chi-square test on data WITH biological replicates.
An incentive of doing this is that this may give you more sensitivity but with a
possibility of incurring false positives. diffReps automatically combines the
biological replicates and generate a probablity vector accordingly. That means,
if you have TWO replicates for treatment group and THREE replicates for control
group, the probablity vector will be adjusted to reflect the replicate number
difference.
DIFFERENTIAL SITES ANNOTATION
diffReps includes a script for annotation of a differential sites list. By default,
it will be evoked after diffReps finished running and annotate the differential
sites based on their locations to the nearest genes. If no nearby genes can be found,
it will also associate the differential sites with heterochromatic regions. A
differential site will be assigned to one of the following categories:
ProximalPromoter: +/- 250bp of TSS
Promoter1k: +/- 1kbp of TSS
Promoter3k: +/- 3kbp of TSS
Genebody: Anywhere between a gene's promoter and up to 1kbp downstream i
of the TES.
Genedeserts: Genomic regions that are depleted with genes and are at least
1Mbp long.
Pericentromere: Between the boundary of a centromere and the closest gene minus
10kbp of that gene's regulatory region.
Subtelomere: Similary defined as pericentromere.
OtherIntergenic: Any region that does not belong to the above categories.
The script can also be triggered manually. For example, if you want to annotate
a differential list diff.h3k4me3.txt, you can use command like:
region_analysis.pl -i diff.h3k4me3.txt -r -d refseq -g mm9
will annotate the list using reference genome mm9 and the RefSeq database. The
output will write to diff.h3k4me3.txt.annotated.
FINDING HISTONE MODIFICATION HOTSPOTS
The distance between two adjacent differential sites can be approximated by a
Poisson distribution if they were positioned by random allocation. In reality,
differential sites are often discovered to be spatially clustered together,
forming so called chromatin modification hotspots. diffReps finds the hotspots
by first building a null model on site-to-site distance, and then looking for
regions that violate the null model with statistical significance using greedy
search. diffReps reports the start, end positions as well as associated p-values
and FDR of the hotspots. In addition, diffReps can accept more than one
differential list of different histone marks as input, so that one can predict
hotspots that show interaction between two or more histone marks.
By default, the finding hotspots routine will be called after diffReps finishes
detecting differential sites. It will try to identify hotspots from the
differential list just generated. The routine can also be used separately. For
example, if you have two differential lists named diff.h3k4me3.txt and
diff.polII.txt, you can look for hotspots that represent the interaction between
H3K4me3 and Pol II using command like:
findHotspots.pl -d diff.h3k4me3.txt diff.polII.txt -o hotspot_k4.pol2.txt
will generate a hotspots list in "hotspot_k4.pol2.txt" file.
EXAMPLES
diffReps requires input of BED files for ChIP-seq alignments for both treatment
and control groups. BED files can be converted from any alignment format, such
as BAM(Tip: you can use BedTools for this). An example of using diffReps for
differential analysis is as follows
diffReps.pl -tr C1.bed C2.bed C3.bed -co S1.bed S2.bed S3.bed -gn mm9 \
-re diff.nb.txt -me nb
The output will be in diff.nb.txt file. By default, a sliding window of 1kbp is
used with a moving step size of 100bp. There are other parameters that can be
tuned for your data. Just type diffReps.pl in command console without specifying
any arguments and hit Enter, you will see a usage summary.
RUNNING TIME
The running time of diffReps totally depends on data and parameter settings. It
can vary wildly between 30min and 10h. The most influential parameters on
running time are window size and step size. The smaller the window size, the
longer the running time which scales linearly.