forked from szpiech/selscan
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME
372 lines (247 loc) · 14.4 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
selscan -- a program to calculate EHH-based scans for positive selection in
genomes
Copyright (C) 2014 Zachary A Szpiech
selscan currently implements EHH, iHS, XP-EHH, and nSL, and requires phased data.
It should be run separately for each chromosome and population (or population
pair for XP-EHH). selscan is 'dumb' with respect ancestral/derived coding and
simply expects haplotype data to be coded 0/1. Unstandardized iHS scores are
thus reported as ln(iHH1/iHH0) based on the coding you have provided.
Citations:
ZA Szpiech and RD Hernandez (2014) selscan: an efficient multi-threaded program
to calculate EHH-based scans for positive selection. Molecular Biology and Evolution,
31: 2824-2827.
A Ferrer-Admetlla, et al. (2014) On detecting incomplete soft or hard selective sweeps
using haplotype structure. Molecular Biology and Evolution, 31: 1275-1291.
PC Sabeti, et al. (2007) Genome-wide detection and characterization of positive
selection in human populations. Nature, 449: 913–918.
BF Voight, et al. (2006) A map of recent positive selection in the human
genome. PLoS Biology, 4: e72.
PC Sabeti, et al. (2002) Detecting recent positive selection in the human
genome from haplotype structure. Nature, 419: 832–837.
28OCT2015 - Updates to norm so that it can handle output from selscan when --ihs-detail is used.
18JUNE2015 - v1.1.0a - When calculating nSL, a mapfile is no longer required for VCF. Physical distances will be read directly from VCF. A mapfile specifying physical distances is stille required for .hap files when calculating nSL. selscan now appropriately reports an error if this is not provided.
15JUNE2015 - Release of 1.1.0. tomkinsc adds the --ihs-detail parameter which, when provided as an adjunct to --ihs, will cause selscan to write out four additional columns to the output file of iHS calculations (in order): derived_ihh_left, derived_ihh_right, ancestral_ihh_left, and ancestral_ihh_right.
An example file row follows, with header added for clarity.
locus phys-pos 1_freq ihh_1 ihh_0 ihs derived_ihh_left derived_ihh_right ancestral_ihh_left ancestral_ihh_right
16133705 16133705 0.873626 0.0961264 0.105545 -0.0934761 0.0505176 0.0456087 0.0539295 0.0516158
From these values we can calculate iHS, but it is preserved in the output for convenience. Having left and right integral information may assist certain machine learning models that gain information from iHH asymmetry.
selscan can now calculate the nSL statistic described in A Ferrer-Admetlla, et al. (2014) MBE, 31: 1275-1291. Also introduced a check on map distance ordering. Three new command line options.
--nsl <bool>: Set this flag to calculate nSL.
Default: false
--max-extend-nsl <int>: The maximum distance an nSL haplotype is allowed to extend from the core.
Set <= 0 for no restriction.
Default: 100
--ihs-detail <bool> : Set this flag to write out left and right iHH scores for '1' and '0' in addition to iHS.
06MAY2015 - Release of 1.0.5. Added basic VCF support. selscan can now read .vcf and .vcf.gz files but without tabix support. A mapfile is required when using VCF. Two new command line options.
13MAY2015 - norm v1.0.5 is released. norm will now normalize ihs or xpehh scores. Two new command line options.
--ihs <bool>: Do iHS normalization.
--xpehh <bool>: Do XP-EHH normalization.
Exactly one of these must be specified when running norm (e.g. ./norm --ihs --files *.ihs.out or ./norm --xpehh --files *.xpehh.out).
--vcf <string>: A VCF file containing haplotype data.
A map file must be specified with --map.
--vcf-ref <string>: A VCF file containing haplotype and map data.
Variants should be coded 0/1. This is the 'reference'
population for XP-EHH calculations and should contain the same number
of loci as the query population. Ignored otherwise.
07JAN2015 - norm bug fix and --skip-low-freq works for single EHH queries.
12NOV2014 - The program norm has been updated to allow for user defined critical values. Two new command line options.
--crit-percent <double>: Set the critical value such that a SNP with iHS in the most extreme CRIT_PERCENT tails (two-tailed) is marked as an extreme SNP.
Not used by default.
--crit-val <double>: Set the critical value such that a SNP with |iHS| > CRIT_VAL is marked as an extreme SNP. Default as in Voight et al.
Default: 2.00
17OCT2014 - Release of 1.0.4. A pairwise sequence difference module has been introduced. This module isn't multithreaded at the moment, but still runs quite fast. Calculating pi in 100bp windows with 198 haplotypes with 707,980 variants on human chr22 finishes in 77s on the test machine. Using 100kb windows, it finishes in 34s. Two new command line options.
--pi <bool>: Set this flag to calculate mean pairwise sequence difference in a sliding window.
Default: false
--pi-win <int>: Sliding window size in bp for calculating pi.
Default: 100
15SEP2014 - Release of 1.0.3. **A critical bug in the XP-EHH module was introduced in version 1.0.2 and had been fixed in 1.0.3. Do not use 1.0.2 for calculating XP-EHH scores.** Thanks to David McWilliams for finding this error. 1.0.3 also introduces support for gzipped input files. You may pass hap.gz, map.gz. and tped.gz files interchangably with unzipped files using the same command line arguments. A new command line option is available.
--trunc-ok <bool>: If an EHH decay reaches the end of a sequence before reaching the cutoff,
integrate the curve anyway (iHS and XPEHH only).
Normal function is to disregard the score for that core.
Default: false
17JUN2014 - Release of 1.0.2. General speed improvements have been made, especially with threading. New support for TPED formatted data and new command line options are available.
--skip-low-freq <bool>: Do not include low frequency variants in the construction of haplotypes (iHS only).
Default: false
--max-extend: The maximum distance an EHH decay curve is allowed to extend from the core.
Set <= 0 for no restriction.
Default: 1000000
--tped <string>: A TPED file containing haplotype and map data.
Variants should be coded 0/1
Default: __hapfile1
--tped-ref <string>: A TPED file containing haplotype and map data.
Variants should be coded 0/1. This is the 'reference'
population for XP-EHH calculations and should contain the same number
of loci as the query population. Ignored otherwise.
Default: __hapfile2
10APR2014 - Release of 1.0.1. Minor bug fixes. XP-EHH output header is now separated by tabs instead of spaces. Removed references to missing data (which is not accepted), and introduced error checking in the event of non-0/1 data being provided.
26MAR2014 - Initial release of selscan 1.0.0.
USAGE:
**Data must be phased and have no missing genotypes.**
To calculate EHH:
./selscan --ehh <locusID> --hap <haplotypes> --map <mapfile> --out <outfile>
Output file:
<outfile>.ehh.<locusID>[.alt].out
Format:
<physicalPos> <geneticPos> <'1' EHH> <'0' EHH> <full sample EHH>
The first two columns give the distance from the core locus. When these values are
negative, the haplotype extends upstream.
To calculate iHS:
./selscan --ihs --hap <haplotypes> --map <mapfile> --out <outfile>
Output file:
<outfile>.ihs[.alt].out
Format:
<locusID> <physicalPos> <'1' freq> <ihh1> <ihh0> <unstandardized iHS>
To calculate XP-EHH:
./selscan --xpehh --hap <pop1 haplotypes> --ref <pop2 haplotypes> --map <mapfile> --out <outfile>
Output file:
<outfile>.xpehh[.alt].out
Format:
<locusID> <physicalPos> <geneticPos> <popA '1' freq> <ihhA> <popB '1' freq> <ihhB> <unstandardized XPEHH>
To calculate nSL:
./selscan --nsl --hap <haplotypes> --map <mapfile> --out <outfile>
Output file:
<outfile>.nsl[.alt].out
Format:
<locusID> <physical position> <'1' freq> <sL1> <sL0> <unstandardized nSL>
EXAMPLES:
selscan --ihs --hap example.hap --map example.map --out example
selscan --ehh Locus4077 --hap example.hap --map example.map --out example
selscan --xpehh --hap example2.pop1.hap --ref example2.pop2.hap --map example2.map --out example2
selscan --nsl --hap example.hap --map example.map --out example
----------Command Line Arguments----------
--alt <bool>: Set this flag to calculate homozygosity based on the sum of the
squared haplotype frequencies in the observed data instead of using
binomial coefficients.
Default: false
--cutoff <double>: The EHH decay cutoff.
Default: 0.05
--ehh <string>: Calculate EHH of the '1' and '0' haplotypes at the specified
locus. Output: <physical dist> <genetic dist> <'1' EHH> <'0' EHH>
Default: __NO_LOCUS__
--ehh-win <int>: When calculating EHH, this is the length of the window in bp
in each direction from the query locus.
Default: 100000
--gap-scale <int>: Gap scale parameter in bp. If a gap is encountered between
two snps > GAP_SCALE and < MAX_GAP, then the genetic distance is
scaled by GAP_SCALE/GAP.
Default: 20000
--hap <string>: A hapfile with one row per haplotype, and one column per
variant. Variants should be coded 0/1
Default: __hapfile1
--help <bool>: Prints this help dialog.
Default: false
--ihs <bool>: Set this flag to calculate iHS.
Default: false
--ihs-detail <bool>: Set this flag to write out left and right iHH scores for '1' and '0' in addition to iHS.
Default: false
--maf <double>: If a site has a MAF below this value, the program will not use
it as a core snp.
Default: 0.05
--map <string>: A mapfile with one row per variant site.
Formatted <chr#> <locusID> <genetic pos> <physical pos>.
Default: __mapfile
--max-extend <int>: The maximum distance an EHH decay curve is allowed to extend from the core.
Set <= 0 for no restriction.
Default: 1000000
--max-extend-nsl <int>: The maximum distance an nSL haplotype is allowed to extend from the core.
Set <= 0 for no restriction.
Default: 100
--max-gap <int>: Maximum allowed gap in bp between two snps.
Default: 200000
--nsl <bool>: Set this flag to calculate nSL.
Default: false
--out <string>: The basename for all output files.
Default: outfile
--pi <bool>: Set this flag to calculate mean pairwise sequence difference in a sliding window.
Default: false
--pi-win <int>: Sliding window size in bp for calculating pi.
Default: 100
--ref <string>: A hapfile with one row per haplotype, and one column per
variant. Variants should be coded 0/1. This is the 'reference'
population for XP-EHH calculations. Ignored otherwise.
Default: __hapfile2
--skip-low-freq <bool>: Do not include low frequency variants in the construction of haplotypes (iHS only).
Default: false
--threads <int>: The number of threads to spawn during the calculation.
Partitions loci across threads.
Default: 1
--tped <string>: A TPED file containing haplotype and map data.
Variants should be coded 0/1
Default: __hapfile1
--tped-ref <string>: A TPED file containing haplotype and map data.
Variants should be coded 0/1. This is the 'reference'
population for XP-EHH calculations and should contain the same number
of loci as the query population. Ignored otherwise.
Default: __hapfile2
--trunc-ok <bool>: If an EHH decay reaches the end of a sequence before reaching the cutoff,
integrate the curve anyway (iHS and XPEHH only).
Normal function is to disregard the score for that core.
Default: false
--vcf <string>: A VCF file containing haplotype data.
A map file must be specified with --map.
Default: __hapfile1
--vcf-ref <string>: A VCF file containing haplotype and map data.
Variants should be coded 0/1. This is the 'reference'
population for XP-EHH calculations and should contain the same number
of loci as the query population. Ignored otherwise.
Default: __hapfile2
--xpehh <bool>: Set this flag to calculate XP-EHH.
Default: false
################################################################################
################################################################################
**NOTE that nSL output files can be normalized with the --ihs flag
norm v1.1.0a -- a program for downstream analysis of selscan output
Source code and binaries can be found at
<https://www.github.com/szpiech/selscan>
Citations:
ZA Szpiech and RD Hernandez (2014) MBE, 31: 2824-2827.
PC Sabeti, et al. (2007) Nature, 449: 913–918.
BF Voight, et al. (2006) PLoS Biology, 4: e72.
To normalize iHS output across frequency bins:
./norm --ihs --files <file1.ihs.out> ... <fileN.ihs.out>
To normalize nSL output across frequency bins:
./norm --ihs --files <file1.nsl.out> ... <fileN.nsl.out>
To normalize iHS output and analyze non-overlapping windows of fixed bp for
extreme iHS scores:
./norm --ihs --files <file1.ihs.out> ... <fileN.ihs.out> --bp-win
To normalize XP-EHH output across frequency bins:
./norm --xpehh --files <file1.xpehh.out> ... <fileN.xpehh.out>
To normalize XP-EHH output and analyze non-overlapping windows of fixed bp for
extreme XP-EHH scores:
./norm --xpehh --files <file1.xpehh.out> ... <fileN.xpehh.out> --bp-win
----------Command Line Arguments----------
--bins <int>: The number of frequency bins in [0,1] for score normalization.
Default: 100
--bp-win <bool>: If set, will use windows of a constant bp size with varying
number of SNPs.
Default: false
--crit-percent <double>: Set the critical value such that a SNP with iHS in the most extreme CRIT_PERCENT tails (two-tailed) is marked as an extreme SNP.
Not used by default.
Default: -1.00
--crit-val <double>: Set the critical value such that a SNP with |iHS| > CRIT_VAL is marked as an extreme SNP. Default as in Voight et al.
Default: 2.00
--files <string1> ... <stringN>: A list of files delimited by whitespace for
joint normalization.
Expected format for iHS files (no header):
<locus name> <physical pos> <freq> <ihh1> <ihh2> <ihs>
Expected format for XP-EHH files (one line header):
<locus name> <physical pos> <genetic pos> <freq1> <ihh1> <freq2> <ihh2> <xpehh>
Default: infile
--first <bool>: Output only the first file's normalization.
Default: false
--help <bool>: Prints this help dialog.
Default: false
--ihs <bool>: Do iHS normalization.
Default: false
--log <string>: The log file name.
Default: logfile
--min-snps <int>: Only consider a bp window if it has at least this many SNPs.
Default: 10
--qbins <int>: Outlying windows are binned by number of sites within each
window. This is the number of quantile bins to use.
Default: 20
--winsize <int>: The non-overlapping window size for calculating the percentage
of extreme SNPs.
Default: 100000
--xpehh <bool>: Do XP-EHH normalization.
Default: false