diff --git a/README.md b/README.md index 3a6275c..fc2217a 100755 --- a/README.md +++ b/README.md @@ -24,6 +24,14 @@ in the `src` directory. The executables will be in `build`. For debugging with g To report bugs, use the [issue](https://github.com/andrej-fischer/cloneHD/issues) interface of github. +# Full documentation + +The full documentation can be found in the `/docs/` subfolder. Click below. + +* [pre-filter](/docs/README-pre-filter.md) +* [filterHD](/docs/README-filterHD.md) +* [cloneHD](/docs/README-cloneHD.md) + # What are cloneHD and filterHD for? cloneHD is a software for reconstructing the subclonal structure of a @@ -59,20 +67,15 @@ prediction (red). (vii) The observed SNV frequencies, corrected for local ploidy, and per genotype (SNVs are assigned ramdomly according to the cloneHD SNV posterior). (All plots are created with Wolfram [Mathematica](http://www.wolfram.com/mathematica/).) -# Full documentation - -The full documentation can be found in the `/docs/` subfolder. Click below. - -* [pre-filter](/docs/README-pre-filter.md) -* [filterHD](/docs/README-filterHD.md) -* [cloneHD](/docs/README-cloneHD.md) - # Tips and tricks +* Note: all input files are assumed to be sorted by genomic coordinate. With Unix, this + can be guaranteed with `sort -k1n,1 -k2n,2 file.txt > sorted-file.txt`. + * Pre-filtering of data can be very important. If filterHD predicts many more jumps than you would expect, it might be necessary to pre-filter the data, removing variable regions, outliers or very short - segments (use programs `pre-filter` and `filterHD`). + segments (use programs the `pre-filter` and `filterHD`). * Make sure that the bias field for the tumor CNA data is meaningful. If a matched normal sample was sequenced with the same @@ -80,14 +83,17 @@ The full documentation can be found in the `/docs/` subfolder. Click below. bias field for the tumor CNA data. Follow the logic of the example given here. -* If the matched-normal sample was sequenced at lower coverage than the tumor sample, it might be necessary to run filterHD with a higher-than-optimal diffusion constant (set with `--sigma [double]`) to obtain a more faithful bias field. Otherwise, the filterHD solution is too stiff and you loose bias detail. +* If the matched-normal sample was sequenced at lower coverage than the tumor sample, + it might be necessary to run filterHD with a higher-than-optimal diffusion constant + (set with `--sigma [double]`) to obtain a more faithful bias field. Otherwise, the + filterHD solution is too stiff and you loose bias detail. * filterHD can sometimes run into local optima. In this case, it might be useful to set initial values for the parameters via `--jumpi [double]` etc. -* By default, cloneHD runs with mass gauging enabled. This seems like - an overkill, but is actually quite useful because you can see some - alternative explanations during the course of the analysis. +* By default, cloneHD runs with mass-gauging enabled. This seems wasteful, + but is actually quite useful because you can see some alternative explanations + during the course of the analysis. * Don't put too much weight on the BIC criterion. It was calibrated using simulated data. For real data, it should be supplemented with @@ -95,8 +101,10 @@ The full documentation can be found in the `/docs/` subfolder. Click below. fixed number of subclones and `--max-tcn [int]` to set the maximum possible total copy number. -* For exome sequencing data, the read depth bias can be enormous. The filterHD estimate - of the bias field might not be useful, especially in segmenting the data. - Use rather, if available, the jumps seen in the BAF data for both CNA and BAF. - +* If high copy numbers are expected only in a few chromosomes, you can increase performance + by using the `--max-tcn [file]` option to specify per-chromosome upper limits. +* For exome sequencing data, the read depth bias can be enormous. The filterHD estimate + of the bias field might not be very useful, especially in segmenting the tumor data. + Use rather, if available, the jumps seen in the BAF data for both CNA and BAF data + (give the BAF jumps file to both `--cna-jumps` and `--baf-jumps`). diff --git a/changelog.md b/changelog.md index 008fcc4..234f1a9 100644 --- a/changelog.md +++ b/changelog.md @@ -1,5 +1,11 @@ # changelog for cloneHD/filterHD +## v1.17.6 / to come + +* fixed nan bug in GOF, when N==0 (missing data). +* fixed bugs in `pre-filter`, when `--window-size` is greater than length +* fixed bug in `pre-filter` in pick-from-match-to mode + ## v1.17.5 / 22.04.2014 * fixed memory alloc bug in pre-filter diff --git a/docs/README-pre-filter.md b/docs/README-pre-filter.md index 2e550d5..b1ff8e0 100755 --- a/docs/README-pre-filter.md +++ b/docs/README-pre-filter.md @@ -4,34 +4,39 @@ The program `pre-filter` can be used to remove loci based on the observed read d ## Typical usage options -* `--data [file]` Input data to be pre-filtered. +* `--data [file]` Input data to be pre-filtered. + + The file format is the same as for cloneHD's `--cna` option. Only the first sample will be used for pre-filtering. - The file format is the same as for cloneHD's `--cna` option. Only the first sample will be used for pre-filtering. +* `--pre [string:"./out"]` Set prefix for all output files. -* `--pre [string:"./out"]` Set prefix for all output files. + The pre-filtered loci and data are print to a file named `pre.pref.txt`. - The pre-filtered loci and data are print to a file named `pre.pref.txt`. - -* `--print-tracks [0/1:0]` Print the window average and window variability. +* `--print-tracks [0/1:0]` Print the window average and window variability. The windowed tracks are used for pre-filtering. They are printed for all loci to a file named `pre.track.txt`. Use this to inspect and tune the pre-filter thresholds. -* `--pick-from [file]` Pre-filter data in this file by picking loci present in `match-to`. +* `--pick-from [file]` Pre-filter data in this file by picking loci present in `match-to`. + + Only loci are selected which fall into a bin also present in `match-to`. Bins in `match-to` are assumed to be of constant width with the given coordinate being the right bin end inclusive, e.g. - Only loci are selected which fall into a bin also present in `match-to`. If bins in `match-to` are in kb (1000, 2000 etc., i.e. 1-1000, 1001-2000 etc.) then loci in `pick-from` are rounded to the next 1000 above. + 1000 = 1-1000 + 2000 = 1001-2000 + 4000 = 3001-4000 + etc. -* `--match-to [file]` Use this file as reference to pick loci in `pick-from`. +* `--match-to [file]` Use this file as reference to pick loci in `pick-from`. Loci in this file are assumed to be equidistant (e.g. per 1 kb, not all bins need be present). The bin width is decided automatically by majority. ## Parameter options -* `--window-size [int:100]` Set the window scale for smoothing (centered, +-size). +* `--window-size [int:100]` Set the window scale for smoothing (centered, +-size). -* `--remove-outlier [double:3.0]` Set the outlier threshold. +* `--remove-outlier [double:3.0]` Set the outlier threshold. All loci are removed, where the observed read depth is further than this value away from the local window-average (in units of sqrt(window-average), assuming Poisson distributed read depths). If set to `0.0`, filter is not applied. -* `--remove-variable [double:2.0]` Set the variability threshold. +* `--remove-variable [double:2.0]` Set the variability threshold. All loci are removed, where the local window-variability exceeds this multiple of the global variability. Global (local) variability is defined as median (mean) of the absolute distance of observed read depths to the global median read depth. If set to `0.0`, filter is not applied. diff --git a/src/clone-fwd-bwd.cpp b/src/clone-fwd-bwd.cpp index ebc751b..561d298 100644 --- a/src/clone-fwd-bwd.cpp +++ b/src/clone-fwd-bwd.cpp @@ -596,7 +596,8 @@ void Clone::get_cna_gof(gsl_vector * post, int s, int evt){ for (int t=0; treads[t][s][idx]; N = cnaEmit->depths[t][s][idx]; - xobs = double(n)/double(N); + if (N==0) continue; + xobs = double(n) / double(N); g = 0; for (int l=0; ldata[t]; @@ -623,6 +624,7 @@ void Clone::get_baf_gof(gsl_vector * post, int s, int evt){ for (int t=0; treads[t][s][idx]; N = bafEmit->depths[t][s][idx]; + if (N==0) continue; xobs = double(n)/double(N); xobs = min(xobs,1.0-xobs); g = 0; @@ -651,6 +653,7 @@ void Clone::get_snv_gof(gsl_vector * post, int s, int evt){ for (int t=0; treads[t][s][idx]; N = snvEmit->depths[t][s][idx]; + if (N==0) continue; xobs = double(n)/double(N); mntcn = (snvEmit->mean_tcn == NULL) diff --git a/src/cloneHD-functions.cpp b/src/cloneHD-functions.cpp index 7830de6..e200c95 100755 --- a/src/cloneHD-functions.cpp +++ b/src/cloneHD-functions.cpp @@ -691,6 +691,10 @@ void get_jump_probability( Clone * myClone, cmdl_opts& opts){ } } // allocations, now that all events are fixed... + if (cnaEmit->is_set) cnaEmit->get_nObs(); + if (bafEmit->is_set) bafEmit->get_nObs(); + if (snvEmit->is_set) snvEmit->get_nObs(); + // if (cnaEmit->is_set){ cnaEmit->allocate_mean_tcn(); if (bafEmit->is_set){ @@ -916,9 +920,14 @@ void print_all_results( Clone * myClone, cmdl_opts& opts){ // CNA llhs/gofs per sample... fprintf(clonal_fp, "# cna-gof\n"); for (int t=0; tcna_llhs[t], - myClone->cna_gofs[t] / double(cnaEmit->total_loci)); + unsigned int total_nObs = 0; + for (int s=0;snSamples; s++){ + for (int evt=0; evtnEvents[s]; evt++){ + total_nObs += cnaEmit->nObs[t][s][evt]; + } + } + if (total_nObs==0 && myClone->cna_gofs[t] > 0) abort(); + fprintf( clonal_fp, "%.4f\n", myClone->cna_gofs[t] / double(total_nObs)); } if (cnaEmit->coarse_grained) print_gof( myClone, cnaEmit, opts); // @@ -953,9 +962,14 @@ void print_all_results( Clone * myClone, cmdl_opts& opts){ //llh and gof per sample fprintf(clonal_fp, "# baf-gof\n"); for (int t=0; tbaf_llhs[t], - myClone->baf_gofs[t] / double(bafEmit->total_loci)); + unsigned int total_nObs = 0; + for (int s=0;snSamples; s++){ + for (int evt=0; evtnEvents[s]; evt++){ + total_nObs += bafEmit->nObs[t][s][evt]; + } + } + if (total_nObs==0 && myClone->baf_gofs[t] > 0) abort(); + fprintf(clonal_fp, "%.4f\n", myClone->baf_gofs[t] / double(total_nObs)); } if (bafEmit->coarse_grained) print_gof( myClone, bafEmit, opts); } @@ -991,9 +1005,14 @@ void print_all_results( Clone * myClone, cmdl_opts& opts){ delete [] myClone->alpha_snv; fprintf(clonal_fp, "# snv-gof\n"); for (int t=0; tsnv_llhs[t], - myClone->snv_gofs[t] / double(snvEmit->total_loci)); + unsigned int total_nObs = 0; + for (int s=0;snSamples; s++){ + for (int evt=0; evtnEvents[s]; evt++){ + total_nObs += snvEmit->nObs[t][s][evt]; + } + } + if (total_nObs==0 && myClone->snv_gofs[t] > 0) abort(); + fprintf( clonal_fp, "%.4f\n", myClone->snv_gofs[t] / double(total_nObs)); } if (snvEmit->coarse_grained) print_gof( myClone, snvEmit, opts); } @@ -1039,15 +1058,21 @@ void print_all_results( Clone * myClone, cmdl_opts& opts){ delete [] myClone->alpha_snv; fprintf(clonal_fp, "# snv-gof\n"); for (int t=0; tsnv_llhs[t], - myClone->snv_gofs[t] / double(snvEmit->total_loci)); + unsigned int total_nObs = 0; + for (int s=0;snSamples; s++){ + for (int evt=0; evtnEvents[s]; evt++){ + total_nObs += snvEmit->nObs[t][s][evt]; + } + } + if (total_nObs==0 && myClone->snv_gofs[t] > 0) abort(); + fprintf( clonal_fp, "%.4f\n", myClone->snv_gofs[t] / double(total_nObs)); } if (snvEmit->coarse_grained) print_gof( myClone, snvEmit, opts); } // - fprintf(clonal_fp, "# cna-ent baf-ent snv-ent\n"); - fprintf(clonal_fp, "%.4e %.4e %.4e\n", myClone->cna_total_ent, myClone->baf_total_ent, myClone->snv_total_ent); + fprintf( clonal_fp, "# cna-ent baf-ent snv-ent\n"); + fprintf( clonal_fp, "%.4e %.4e %.4e\n", + myClone->cna_total_ent, myClone->baf_total_ent, myClone->snv_total_ent); fclose(clonal_fp); if (snv_utcn_fp != NULL) fclose(snv_utcn_fp); if (baf_utcn_fp != NULL) fclose(baf_utcn_fp); @@ -1316,7 +1341,9 @@ void print_gof( Clone * myClone, Emission * myEmit, cmdl_opts& opts){ myEmit->chr[s], myEmit->loci[s][first], last-first+1, myEmit->loci[s][last] ); for (int t=0; t< myClone->nTimes; t++){ - fprintf( gof_fp, " %12.2f", gofs[t][s][evt]); + fprintf( gof_fp, " %.4f", + gofs[t][s][evt] > 0 ? gofs[t][s][evt] / double(myEmit->nObs[t][s][evt]) : 0.0 + ); } fprintf( gof_fp,"\n"); } diff --git a/src/emission.cpp b/src/emission.cpp index ef6ab58..ded2885 100755 --- a/src/emission.cpp +++ b/src/emission.cpp @@ -41,6 +41,7 @@ Emission::Emission(){ coarse_grained = 0; idx_to_Event_mapped=0; connect=0; + nObs = NULL; } @@ -259,7 +260,42 @@ void Emission::allocate_av_cn(int maxtcn){//only once! } } - +void Emission::get_nObs(){ + if (nObs!=NULL) abort(); + nObs = new unsigned int ** [nTimes]; + for (int t=0; t0) abort(); + continue; + } + nObs[t][s][evt]++; + } + } + } + } +} //Destructor diff --git a/src/emission.h b/src/emission.h index 8573a61..d74ef7e 100755 --- a/src/emission.h +++ b/src/emission.h @@ -74,6 +74,8 @@ class Emission{ unsigned int ** loci; unsigned int ** mask; unsigned int ** dist; + unsigned int *** nObs; + void get_nObs(); int * nSites; int * chr; std::set chrs; diff --git a/src/filterHD.cpp b/src/filterHD.cpp index ea03577..e310be0 100755 --- a/src/filterHD.cpp +++ b/src/filterHD.cpp @@ -203,7 +203,7 @@ int main (int argc, const char * argv[]){ jumps[s][l] *= exp(myJD.pnojump[s][l]); } //goodness of fit... - if (mask==NULL || mask[s][l] == 1){ + if ((mask==NULL || mask[s][l] == 1) && myEmit.depths[t][s][l] > 0){ xobs = double(myEmit.reads[t][s][l]) / double(myEmit.depths[t][s][l]); if (opts.reflect) xobs = min(xobs,1.0-xobs); double x=0,g=0,dg=0; diff --git a/src/pre-filter.cpp b/src/pre-filter.cpp index d8b57d4..7475749 100755 --- a/src/pre-filter.cpp +++ b/src/pre-filter.cpp @@ -197,22 +197,23 @@ void pre_filter( Emission& dataEmit, cmdl_opts& opts){ } //***Pre-filtering of each chromosome in turn*** for (int s=0; s < dataEmit.nSamples; s++){ + int L = dataEmit.nSites[s]; unsigned int * rds = dataEmit.reads[0][s]; unsigned int * dps = dataEmit.depths[0][s]; if (dataEmit.nSites[s] == 0) abort(); - double * wMean = new double [dataEmit.nSites[s]]; - double * wVar = new double [dataEmit.nSites[s]]; - int * mask = new int [dataEmit.nSites[s]]; - for (int l=0; l< dataEmit.nSites[s]; l++){ + double * wMean = new double [L]; + double * wVar = new double [L]; + int * mask = new int [L]; + for (int l=0; l 0.0){//***REMOVE LOCI ACCORDING TO VARIABILITY*** //get median read depth vector allx; - for (int l=0; l 0){ allx.push_back(double(rds[l]) / double(dps[l])); } @@ -221,7 +222,7 @@ void pre_filter( Emission& dataEmit, cmdl_opts& opts){ median = gsl_stats_quantile_from_sorted_data ( allx.data(), 1, allx.size(), 0.5); allx.clear(); //get median absolute deviation from median - for (int l=0; l 0){ allx.push_back( fabs(double(rds[l]) / double(dps[l]) - median) ); } @@ -232,83 +233,77 @@ void pre_filter( Emission& dataEmit, cmdl_opts& opts){ //get local variablility and mask double sum = 0.0; int front=-1, size=0, center=-opts.wSize-1, back=-2*opts.wSize-1; - while (center < dataEmit.nSites[s]-1){ - if ( front == dataEmit.nSites[s] ){ - size--; - } - else if ( size < 2*opts.wSize+1 ){ - size++; - } - while (front < dataEmit.nSites[s]-1){ + while (center < L-1){ + while (front < L){ front++; - if (dps[front] > 0) break; + if (front==L || dps[front] > 0) break; } - if (front < dataEmit.nSites[s]){ + if (front < L){ + if (size < 2*opts.wSize+1) size++; sum += fabs(median - double(rds[front]) / double(dps[front])); - } - while (center < dataEmit.nSites[s]-1){ - center++; - if (center < 0 || dps[center] > 0 ) break; - } + } if (back >= 0){ sum -= fabs(median - double(rds[back]) / double(dps[back])); + if (front==L && size>1) size--; } - while (back < dataEmit.nSites[s]-1){ + while (back < L){ back++; - if ( back < 0 || dps[back] > 0 ) break; + if ( back<0 || back==L || dps[back] > 0 ) break; + } + while (center < L-1){ + center++; + if (center < 0 || dps[center] > 0 ) break; } if (center >= 0){ wVar[center] = sum / double(size); } } //apply filter - for (int l=0; l opts.remVar * medAbsDev) mask[l] = 0; } } if (opts.remOut > 0.0){//***REMOVE LOCI ACCORDING TO OUTLIER*** double sum = 0.0; int front=-1, size=0, center=-opts.wSize-1, back=-2*opts.wSize-1; - while (center < dataEmit.nSites[s]-1){ - if ( front == dataEmit.nSites[s] ){ - size--; - } - else if ( size < 2*opts.wSize+1 ){ - size++; - } - while (front < dataEmit.nSites[s]-1){ + while (center < L-1){ + while (front < L){ front++; - if (dps[front] > 0 && mask[front]==1) break; + if (front == L) break; + if (dps[front] > 0 && mask[front]==1) break; } - if (front < dataEmit.nSites[s]){ + if (front < L){ + if (size < 2*opts.wSize+1) size++; sum += double(rds[front]) / double(dps[front]); } - while (center < dataEmit.nSites[s]-1){ - center++; - if (center < 0 || (dps[center] > 0 && mask[center]==1)) break; - } if (back >= 0){ sum -= double(rds[back]) / double(dps[back]); + if (front==L && size>1) size--; } - while (back < dataEmit.nSites[s]-1){ + while (back < L){ back++; - if ( back<0 || (dps[back]>0 && mask[back]==1 ) ) break; + if (back==L || back<0) break; + if (dps[back]>0 && mask[back]==1) break; + } + while (center < L-1){ + center++; + if (center<0 || (dps[center]>0 && mask[center]==1)) break; } if (center >= 0){ wMean[center] = sum / double(size); } } //apply filter - for (int l=0; l opts.remOut * sqrt(wMean[l]) ){ - mask[l] = 0; - } - } + for (int l=0; l opts.remOut * sqrt(wMean[l]) ){ + mask[l] = 0; + } + } } //print - for (int l=0; l 0) ? wVar[l] / medAbsDev : 0 ); } //cleanup @@ -342,16 +338,12 @@ void pick_match( Emission& pickEmit, Emission& matchEmit, cmdl_opts& opts){ exit(1); } int matchSample = matchEmit.idx_of[chr]; - //get all bins and determine bin width... - // NOTE: assumes constant width bins - std::set bins; + // get all bins and determine bin width... + // !!!NOTE: assumes uniform bin width!!! std::map diffs; - for (int l=0; l0){ - int diff = (int) (matchEmit.loci[matchSample][l] - matchEmit.loci[matchSample][l-1]); - diffs[diff]++; - } + unsigned int * mloci = matchEmit.loci[matchSample]; + for (int l=1; l::iterator it; @@ -362,12 +354,14 @@ void pick_match( Emission& pickEmit, Emission& matchEmit, cmdl_opts& opts){ } } //now pick... - for (int l=0; l