Merge branch 'develop'

ivazquez · Apr 24, 2014 · aabcd3a · aabcd3a
2 parents 2ed799c + 1900b37
commit aabcd3a
Show file tree

Hide file tree

Showing 9 changed files with 192 additions and 111 deletions.
diff --git a/README.md b/README.md
@@ -24,6 +24,14 @@ in the `src` directory. The executables will be in `build`. For debugging with g
 
 To report bugs, use the [issue](https://github.com/andrej-fischer/cloneHD/issues) interface of github.
 
+# Full documentation
+
+The full documentation can be found in the `/docs/` subfolder. Click below.
+
+*  [pre-filter](/docs/README-pre-filter.md)
+*  [filterHD](/docs/README-filterHD.md)
+*  [cloneHD](/docs/README-cloneHD.md)
+
 # What are cloneHD and filterHD for?
 
 cloneHD is a software for reconstructing the subclonal structure of a
@@ -59,44 +67,44 @@ prediction (red).
 (vii) The observed SNV frequencies, corrected for local ploidy, and per genotype (SNVs are assigned ramdomly according to the cloneHD SNV posterior).
 (All plots are created with Wolfram [Mathematica](http://www.wolfram.com/mathematica/).)
 
-# Full documentation
-
-The full documentation can be found in the `/docs/` subfolder. Click below.
-
-*  [pre-filter](/docs/README-pre-filter.md)
-*  [filterHD](/docs/README-filterHD.md)
-*  [cloneHD](/docs/README-cloneHD.md)
-
 # Tips and tricks
 
+*  Note: all input files are assumed to be sorted by genomic coordinate. With Unix, this
+   can be guaranteed with `sort -k1n,1 -k2n,2 file.txt > sorted-file.txt`.
+
 *  Pre-filtering of data can be very important. If filterHD predicts
    many more jumps than you would expect, it might be necessary to
    pre-filter the data, removing variable regions, outliers or very short 
-   segments (use programs `pre-filter` and `filterHD`).
+   segments (use programs the `pre-filter` and `filterHD`).
 
 *  Make sure that the bias field for the tumor CNA data is
    meaningful. If a matched normal sample was sequenced with the same
    pipeline, its read depth profile, as predicted by filterHD, can be used as a
    bias field for the tumor CNA data. Follow the logic of the example
    given here.
 
-*  If the matched-normal sample was sequenced at lower coverage than the tumor sample, it might be necessary to run filterHD with a higher-than-optimal diffusion constant (set with `--sigma [double]`) to obtain a more faithful bias field. Otherwise, the filterHD solution is too stiff and you loose bias detail.
+*  If the matched-normal sample was sequenced at lower coverage than the tumor sample, 
+   it might be necessary to run filterHD with a higher-than-optimal diffusion constant 
+   (set with `--sigma [double]`) to obtain a more faithful bias field. Otherwise, the 
+   filterHD solution is too stiff and you loose bias detail.
 
 *  filterHD can sometimes run into local optima. In this case, it might be useful to
    set initial values for the parameters via `--jumpi [double]` etc.
 
-*  By default, cloneHD runs with mass gauging enabled. This seems like
-   an overkill, but is actually quite useful because you can see some
-   alternative explanations during the course of the analysis.
+*  By default, cloneHD runs with mass-gauging enabled. This seems wasteful,
+   but is actually quite useful because you can see some alternative explanations
+   during the course of the analysis.
 
 *  Don't put too much weight on the BIC criterion. It was calibrated
    using simulated data. For real data, it should be supplemented with
    common sense and biological knowledge. Use `--force [int]` to use a
    fixed number of subclones and `--max-tcn [int]` to set the maximum possible total
    copy number.
 
-*  For exome sequencing data, the read depth bias can be enormous. The filterHD estimate 
-   of the bias field might not be useful, especially in segmenting the data.
-   Use rather, if available, the jumps seen in the BAF data for both CNA and BAF.
-
+*  If high copy numbers are expected only in a few chromosomes, you can increase performance
+   by using the `--max-tcn [file]` option to specify per-chromosome upper limits.
 
+*  For exome sequencing data, the read depth bias can be enormous. The filterHD estimate 
+   of the bias field might not be very useful, especially in segmenting the tumor data.
+   Use rather, if available, the jumps seen in the BAF data for both CNA and BAF data
+   (give the BAF jumps file to both `--cna-jumps` and `--baf-jumps`).
diff --git a/changelog.md b/changelog.md
@@ -1,5 +1,11 @@
 # changelog for cloneHD/filterHD
 
+## v1.17.6 / to come
+
+*  fixed nan bug in GOF, when N==0 (missing data).
+*  fixed bugs in `pre-filter`, when `--window-size` is greater than length
+*  fixed bug in `pre-filter` in pick-from-match-to mode
+
 ## v1.17.5 / 22.04.2014
 
 *  fixed memory alloc bug in pre-filter

diff --git a/docs/README-pre-filter.md b/docs/README-pre-filter.md
@@ -4,34 +4,39 @@ The program `pre-filter` can be used to remove loci based on the observed read d
 
 ## Typical usage options
 
-*    `--data [file]`  Input data to be pre-filtered. 
+*  `--data [file]`  Input data to be pre-filtered. 
+
+    The file format is the same as for cloneHD's `--cna` option. Only the first sample will be used for pre-filtering.
 
-     The file format is the same as for cloneHD's `--cna` option. Only the first sample will be used for pre-filtering.
+*  `--pre [string:"./out"]`  Set prefix for all output files. 
 
-*    `--pre [string:"./out"]`  Set prefix for all output files. 
+    The pre-filtered loci and data are print to a file named `pre.pref.txt`.
 
-     The pre-filtered loci and data are print to a file named `pre.pref.txt`.
-
-*    `--print-tracks [0/1:0]`  Print the window average and window variability. 
+*  `--print-tracks [0/1:0]`  Print the window average and window variability. 
 
      The windowed tracks are used for pre-filtering. They are printed for all loci to a file named `pre.track.txt`. Use this to inspect and tune the pre-filter thresholds.
 
-*    `--pick-from [file]`  Pre-filter data in this file by picking loci present in `match-to`. 
+*  `--pick-from [file]`  Pre-filter data in this file by picking loci present in `match-to`. 
+
+     Only loci are selected which fall into a bin also present in `match-to`. Bins in `match-to` are assumed to be of constant width with the given coordinate being the right bin end inclusive, e.g.
 
-     Only loci are selected which fall into a bin also present in `match-to`. If bins in `match-to` are in kb (1000, 2000 etc., i.e. 1-1000, 1001-2000 etc.) then loci in `pick-from` are rounded to the next 1000 above.
+        1000  =  1-1000
+        2000  =  1001-2000
+        4000  =  3001-4000
+        etc.
 
-*    `--match-to [file]`  Use this file as reference to pick loci in `pick-from`. 
+*  `--match-to [file]`  Use this file as reference to pick loci in `pick-from`. 
 
      Loci in this file are assumed to be equidistant (e.g. per 1 kb, not all bins need be present). The bin width is decided automatically by majority.
 
 ## Parameter options
 
-*    `--window-size [int:100]` Set the window scale for smoothing (centered, +-size).
+*  `--window-size [int:100]` Set the window scale for smoothing (centered, +-size).
 
-*    `--remove-outlier [double:3.0]`  Set the outlier threshold.
+*  `--remove-outlier [double:3.0]`  Set the outlier threshold.
 
      All loci are removed, where the observed read depth is further than this value away from the local window-average (in units of sqrt(window-average), assuming Poisson distributed read depths). If set to `0.0`, filter is not applied. 
 
-*    `--remove-variable [double:2.0]`  Set the variability threshold.
+*  `--remove-variable [double:2.0]`  Set the variability threshold.
 
      All loci are removed, where the local window-variability exceeds this multiple of the global variability. Global (local) variability is defined as median (mean) of the absolute distance of observed read depths to the global median read depth. If set to `0.0`, filter is not applied. 
diff --git a/src/clone-fwd-bwd.cpp b/src/clone-fwd-bwd.cpp
@@ -596,7 +596,8 @@ void Clone::get_cna_gof(gsl_vector * post, int s, int evt){
     for (int t=0; t<nTimes; t++){
       n = cnaEmit->reads[t][s][idx];
       N = cnaEmit->depths[t][s][idx];
-      xobs = double(n)/double(N);
+      if (N==0) continue;
+      xobs = double(n) / double(N);
       g = 0;
       for (int l=0; l<nLevels; l++){
 	x = tcn[cnaChr][t][l] * b * mass->data[t];
@@ -623,6 +624,7 @@ void Clone::get_baf_gof(gsl_vector * post, int s, int evt){
     for (int t=0; t<nTimes; t++){
       n = bafEmit->reads[t][s][idx];
       N = bafEmit->depths[t][s][idx];
+      if (N==0) continue;
       xobs = double(n)/double(N);
       xobs = min(xobs,1.0-xobs);
       g = 0;
@@ -651,6 +653,7 @@ void Clone::get_snv_gof(gsl_vector * post, int s, int evt){
     for (int t=0; t<nTimes; t++){
       n = snvEmit->reads[t][s][idx];
       N = snvEmit->depths[t][s][idx];
+      if (N==0) continue;
       xobs = double(n)/double(N);
       mntcn 
 	= (snvEmit->mean_tcn == NULL) 

diff --git a/src/cloneHD-functions.cpp b/src/cloneHD-functions.cpp
@@ -691,6 +691,10 @@ void get_jump_probability( Clone * myClone, cmdl_opts& opts){
     }    
   }
   // allocations, now that all events are fixed...
+  if (cnaEmit->is_set) cnaEmit->get_nObs();
+  if (bafEmit->is_set) bafEmit->get_nObs();
+  if (snvEmit->is_set) snvEmit->get_nObs();
+  //
   if (cnaEmit->is_set){
     cnaEmit->allocate_mean_tcn();
     if (bafEmit->is_set){
@@ -916,9 +920,14 @@ void  print_all_results( Clone * myClone, cmdl_opts& opts){
     // CNA llhs/gofs per sample...
     fprintf(clonal_fp, "# cna-gof\n");
     for (int t=0; t<nT; t++){
-      fprintf( clonal_fp, "%.4f\n", 
-	       //myClone->cna_llhs[t], 
-	       myClone->cna_gofs[t] / double(cnaEmit->total_loci));
+      unsigned int total_nObs = 0;
+      for (int s=0;s<cnaEmit->nSamples; s++){
+	for (int evt=0; evt<cnaEmit->nEvents[s]; evt++){
+	  total_nObs += cnaEmit->nObs[t][s][evt];
+	}
+      }
+      if (total_nObs==0 && myClone->cna_gofs[t] > 0) abort();
+      fprintf( clonal_fp, "%.4f\n", myClone->cna_gofs[t] / double(total_nObs));
     }
     if (cnaEmit->coarse_grained) print_gof( myClone, cnaEmit, opts);
     //
@@ -953,9 +962,14 @@ void  print_all_results( Clone * myClone, cmdl_opts& opts){
       //llh and gof per sample
       fprintf(clonal_fp, "# baf-gof\n");
       for (int t=0; t<nT; t++){
-	fprintf(clonal_fp, "%.4f\n", 
-		//myClone->baf_llhs[t], 
-		myClone->baf_gofs[t] / double(bafEmit->total_loci));
+	unsigned int total_nObs = 0;
+	for (int s=0;s<bafEmit->nSamples; s++){
+	  for (int evt=0; evt<bafEmit->nEvents[s]; evt++){
+	    total_nObs += bafEmit->nObs[t][s][evt];
+	  }
+	}
+	if (total_nObs==0 && myClone->baf_gofs[t] > 0) abort();
+	fprintf(clonal_fp, "%.4f\n", myClone->baf_gofs[t] / double(total_nObs));
       }    
       if (bafEmit->coarse_grained) print_gof( myClone, bafEmit, opts);
     }
@@ -991,9 +1005,14 @@ void  print_all_results( Clone * myClone, cmdl_opts& opts){
       delete [] myClone->alpha_snv;
       fprintf(clonal_fp, "# snv-gof\n");
       for (int t=0; t<nT; t++){
-	fprintf( clonal_fp, "%.4f\n", 
-		 //myClone->snv_llhs[t], 
-		 myClone->snv_gofs[t] / double(snvEmit->total_loci));
+	unsigned int total_nObs = 0;
+	for (int s=0;s<snvEmit->nSamples; s++){
+	  for (int evt=0; evt<snvEmit->nEvents[s]; evt++){
+	    total_nObs += snvEmit->nObs[t][s][evt];
+	  }
+	}
+	if (total_nObs==0 && myClone->snv_gofs[t] > 0) abort();
+	fprintf( clonal_fp, "%.4f\n", myClone->snv_gofs[t] / double(total_nObs));
       }
       if (snvEmit->coarse_grained) print_gof( myClone, snvEmit, opts);
     }
@@ -1039,15 +1058,21 @@ void  print_all_results( Clone * myClone, cmdl_opts& opts){
     delete [] myClone->alpha_snv;
     fprintf(clonal_fp, "# snv-gof\n");
     for (int t=0; t<nT; t++){
-      fprintf( clonal_fp, "%.4f\n", 
-	       //myClone->snv_llhs[t], 
-	       myClone->snv_gofs[t] / double(snvEmit->total_loci));
+      unsigned int total_nObs = 0;
+      for (int s=0;s<snvEmit->nSamples; s++){
+	for (int evt=0; evt<snvEmit->nEvents[s]; evt++){
+	  total_nObs += snvEmit->nObs[t][s][evt];
+	}
+      }
+      if (total_nObs==0 && myClone->snv_gofs[t] > 0) abort();
+      fprintf( clonal_fp, "%.4f\n", myClone->snv_gofs[t] / double(total_nObs));
     }
     if (snvEmit->coarse_grained) print_gof( myClone, snvEmit, opts);
   }
   //
-  fprintf(clonal_fp, "# cna-ent baf-ent snv-ent\n");
-  fprintf(clonal_fp, "%.4e %.4e %.4e\n", myClone->cna_total_ent, myClone->baf_total_ent, myClone->snv_total_ent);
+  fprintf( clonal_fp, "# cna-ent baf-ent snv-ent\n");
+  fprintf( clonal_fp, "%.4e %.4e %.4e\n", 
+	   myClone->cna_total_ent, myClone->baf_total_ent, myClone->snv_total_ent);
   fclose(clonal_fp);
   if (snv_utcn_fp != NULL) fclose(snv_utcn_fp);
   if (baf_utcn_fp != NULL) fclose(baf_utcn_fp);
@@ -1316,7 +1341,9 @@ void print_gof( Clone * myClone, Emission * myEmit, cmdl_opts& opts){
 	       myEmit->chr[s], myEmit->loci[s][first], last-first+1, myEmit->loci[s][last]
 	       );
       for (int t=0; t< myClone->nTimes; t++){
-	fprintf( gof_fp, " %12.2f", gofs[t][s][evt]);
+	fprintf( gof_fp, " %.4f", 
+		 gofs[t][s][evt] > 0 ? gofs[t][s][evt] / double(myEmit->nObs[t][s][evt]) : 0.0
+		 );
       }
       fprintf( gof_fp,"\n");
     }

diff --git a/src/emission.cpp b/src/emission.cpp
@@ -41,6 +41,7 @@ Emission::Emission(){
   coarse_grained = 0;
   idx_to_Event_mapped=0;
   connect=0;
+  nObs = NULL;
 }
 
 
@@ -259,7 +260,42 @@ void Emission::allocate_av_cn(int maxtcn){//only once!
   }
 }
 
-
+void Emission::get_nObs(){
+  if (nObs!=NULL) abort();
+  nObs = new unsigned int ** [nTimes];
+  for (int t=0; t<nTimes; t++){
+    nObs[t] = new unsigned int * [nSamples];
+    for (int s=0; s<nSamples; s++){
+      if (nEvents[s] == 0){
+	nObs[t][s] = NULL; 
+	continue;
+      }
+      nObs[t][s] = new unsigned int [nEvents[s]];
+    }
+  }
+  for (int s=0; s<nSamples; s++){
+    for (int evt=0; evt<nEvents[s]; evt++){
+      int first = idx_of_event[s][evt];     
+      int last 
+	= (evt < nEvents[s]-1) 
+	? idx_of_event[s][evt+1] - 1 
+	: nSites[s] - 1; 
+      unsigned int n,N;
+      for (int t=0; t<nTimes; t++){
+	nObs[t][s][evt] = 0;
+	for (int idx=first; idx<=last; idx++){
+	  n = reads[t][s][idx];
+	  N = depths[t][s][idx];
+	  if (N==0){
+	    if (n>0) abort();
+	    continue;
+	  }
+	  nObs[t][s][evt]++;
+	}
+      }
+    }
+  }
+}
 
 
 //Destructor

diff --git a/src/emission.h b/src/emission.h
@@ -74,6 +74,8 @@ class Emission{
   unsigned int ** loci;
   unsigned int ** mask;
   unsigned int ** dist;
+  unsigned int *** nObs;
+  void get_nObs();
   int * nSites;
   int * chr;
   std::set<int> chrs;

diff --git a/src/filterHD.cpp b/src/filterHD.cpp
@@ -203,7 +203,7 @@ int main (int argc, const char * argv[]){
 	  jumps[s][l] *= exp(myJD.pnojump[s][l]);
 	}
 	//goodness of fit...
-	if (mask==NULL || mask[s][l] == 1){
+	if ((mask==NULL || mask[s][l] == 1) && myEmit.depths[t][s][l] > 0){
 	  xobs = double(myEmit.reads[t][s][l]) / double(myEmit.depths[t][s][l]);
 	  if (opts.reflect) xobs = min(xobs,1.0-xobs);	 
 	  double x=0,g=0,dg=0;