Skip to content

Commit

Permalink
Merge branch 'develop'
Browse files Browse the repository at this point in the history
  • Loading branch information
andrej-fischer committed Apr 24, 2014
2 parents 2ed799c + 1900b37 commit aabcd3a
Show file tree
Hide file tree
Showing 9 changed files with 192 additions and 111 deletions.
42 changes: 25 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,14 @@ in the `src` directory. The executables will be in `build`. For debugging with g

To report bugs, use the [issue](https://github.com/andrej-fischer/cloneHD/issues) interface of github.

# Full documentation

The full documentation can be found in the `/docs/` subfolder. Click below.

* [pre-filter](/docs/README-pre-filter.md)
* [filterHD](/docs/README-filterHD.md)
* [cloneHD](/docs/README-cloneHD.md)

# What are cloneHD and filterHD for?

cloneHD is a software for reconstructing the subclonal structure of a
Expand Down Expand Up @@ -59,44 +67,44 @@ prediction (red).
(vii) The observed SNV frequencies, corrected for local ploidy, and per genotype (SNVs are assigned ramdomly according to the cloneHD SNV posterior).
(All plots are created with Wolfram [Mathematica](http://www.wolfram.com/mathematica/).)

# Full documentation

The full documentation can be found in the `/docs/` subfolder. Click below.

* [pre-filter](/docs/README-pre-filter.md)
* [filterHD](/docs/README-filterHD.md)
* [cloneHD](/docs/README-cloneHD.md)

# Tips and tricks

* Note: all input files are assumed to be sorted by genomic coordinate. With Unix, this
can be guaranteed with `sort -k1n,1 -k2n,2 file.txt > sorted-file.txt`.

* Pre-filtering of data can be very important. If filterHD predicts
many more jumps than you would expect, it might be necessary to
pre-filter the data, removing variable regions, outliers or very short
segments (use programs `pre-filter` and `filterHD`).
segments (use programs the `pre-filter` and `filterHD`).

* Make sure that the bias field for the tumor CNA data is
meaningful. If a matched normal sample was sequenced with the same
pipeline, its read depth profile, as predicted by filterHD, can be used as a
bias field for the tumor CNA data. Follow the logic of the example
given here.

* If the matched-normal sample was sequenced at lower coverage than the tumor sample, it might be necessary to run filterHD with a higher-than-optimal diffusion constant (set with `--sigma [double]`) to obtain a more faithful bias field. Otherwise, the filterHD solution is too stiff and you loose bias detail.
* If the matched-normal sample was sequenced at lower coverage than the tumor sample,
it might be necessary to run filterHD with a higher-than-optimal diffusion constant
(set with `--sigma [double]`) to obtain a more faithful bias field. Otherwise, the
filterHD solution is too stiff and you loose bias detail.

* filterHD can sometimes run into local optima. In this case, it might be useful to
set initial values for the parameters via `--jumpi [double]` etc.

* By default, cloneHD runs with mass gauging enabled. This seems like
an overkill, but is actually quite useful because you can see some
alternative explanations during the course of the analysis.
* By default, cloneHD runs with mass-gauging enabled. This seems wasteful,
but is actually quite useful because you can see some alternative explanations
during the course of the analysis.

* Don't put too much weight on the BIC criterion. It was calibrated
using simulated data. For real data, it should be supplemented with
common sense and biological knowledge. Use `--force [int]` to use a
fixed number of subclones and `--max-tcn [int]` to set the maximum possible total
copy number.

* For exome sequencing data, the read depth bias can be enormous. The filterHD estimate
of the bias field might not be useful, especially in segmenting the data.
Use rather, if available, the jumps seen in the BAF data for both CNA and BAF.

* If high copy numbers are expected only in a few chromosomes, you can increase performance
by using the `--max-tcn [file]` option to specify per-chromosome upper limits.

* For exome sequencing data, the read depth bias can be enormous. The filterHD estimate
of the bias field might not be very useful, especially in segmenting the tumor data.
Use rather, if available, the jumps seen in the BAF data for both CNA and BAF data
(give the BAF jumps file to both `--cna-jumps` and `--baf-jumps`).
6 changes: 6 additions & 0 deletions changelog.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,11 @@
# changelog for cloneHD/filterHD

## v1.17.6 / to come

* fixed nan bug in GOF, when N==0 (missing data).
* fixed bugs in `pre-filter`, when `--window-size` is greater than length
* fixed bug in `pre-filter` in pick-from-match-to mode

## v1.17.5 / 22.04.2014

* fixed memory alloc bug in pre-filter
Expand Down
29 changes: 17 additions & 12 deletions docs/README-pre-filter.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,34 +4,39 @@ The program `pre-filter` can be used to remove loci based on the observed read d

## Typical usage options

* `--data [file]` Input data to be pre-filtered.
* `--data [file]` Input data to be pre-filtered.

The file format is the same as for cloneHD's `--cna` option. Only the first sample will be used for pre-filtering.

The file format is the same as for cloneHD's `--cna` option. Only the first sample will be used for pre-filtering.
* `--pre [string:"./out"]` Set prefix for all output files.

* `--pre [string:"./out"]` Set prefix for all output files.
The pre-filtered loci and data are print to a file named `pre.pref.txt`.

The pre-filtered loci and data are print to a file named `pre.pref.txt`.

* `--print-tracks [0/1:0]` Print the window average and window variability.
* `--print-tracks [0/1:0]` Print the window average and window variability.

The windowed tracks are used for pre-filtering. They are printed for all loci to a file named `pre.track.txt`. Use this to inspect and tune the pre-filter thresholds.

* `--pick-from [file]` Pre-filter data in this file by picking loci present in `match-to`.
* `--pick-from [file]` Pre-filter data in this file by picking loci present in `match-to`.

Only loci are selected which fall into a bin also present in `match-to`. Bins in `match-to` are assumed to be of constant width with the given coordinate being the right bin end inclusive, e.g.

Only loci are selected which fall into a bin also present in `match-to`. If bins in `match-to` are in kb (1000, 2000 etc., i.e. 1-1000, 1001-2000 etc.) then loci in `pick-from` are rounded to the next 1000 above.
1000 = 1-1000
2000 = 1001-2000
4000 = 3001-4000
etc.

* `--match-to [file]` Use this file as reference to pick loci in `pick-from`.
* `--match-to [file]` Use this file as reference to pick loci in `pick-from`.

Loci in this file are assumed to be equidistant (e.g. per 1 kb, not all bins need be present). The bin width is decided automatically by majority.

## Parameter options

* `--window-size [int:100]` Set the window scale for smoothing (centered, +-size).
* `--window-size [int:100]` Set the window scale for smoothing (centered, +-size).

* `--remove-outlier [double:3.0]` Set the outlier threshold.
* `--remove-outlier [double:3.0]` Set the outlier threshold.

All loci are removed, where the observed read depth is further than this value away from the local window-average (in units of sqrt(window-average), assuming Poisson distributed read depths). If set to `0.0`, filter is not applied.

* `--remove-variable [double:2.0]` Set the variability threshold.
* `--remove-variable [double:2.0]` Set the variability threshold.

All loci are removed, where the local window-variability exceeds this multiple of the global variability. Global (local) variability is defined as median (mean) of the absolute distance of observed read depths to the global median read depth. If set to `0.0`, filter is not applied.
5 changes: 4 additions & 1 deletion src/clone-fwd-bwd.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -596,7 +596,8 @@ void Clone::get_cna_gof(gsl_vector * post, int s, int evt){
for (int t=0; t<nTimes; t++){
n = cnaEmit->reads[t][s][idx];
N = cnaEmit->depths[t][s][idx];
xobs = double(n)/double(N);
if (N==0) continue;
xobs = double(n) / double(N);
g = 0;
for (int l=0; l<nLevels; l++){
x = tcn[cnaChr][t][l] * b * mass->data[t];
Expand All @@ -623,6 +624,7 @@ void Clone::get_baf_gof(gsl_vector * post, int s, int evt){
for (int t=0; t<nTimes; t++){
n = bafEmit->reads[t][s][idx];
N = bafEmit->depths[t][s][idx];
if (N==0) continue;
xobs = double(n)/double(N);
xobs = min(xobs,1.0-xobs);
g = 0;
Expand Down Expand Up @@ -651,6 +653,7 @@ void Clone::get_snv_gof(gsl_vector * post, int s, int evt){
for (int t=0; t<nTimes; t++){
n = snvEmit->reads[t][s][idx];
N = snvEmit->depths[t][s][idx];
if (N==0) continue;
xobs = double(n)/double(N);
mntcn
= (snvEmit->mean_tcn == NULL)
Expand Down
57 changes: 42 additions & 15 deletions src/cloneHD-functions.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -691,6 +691,10 @@ void get_jump_probability( Clone * myClone, cmdl_opts& opts){
}
}
// allocations, now that all events are fixed...
if (cnaEmit->is_set) cnaEmit->get_nObs();
if (bafEmit->is_set) bafEmit->get_nObs();
if (snvEmit->is_set) snvEmit->get_nObs();
//
if (cnaEmit->is_set){
cnaEmit->allocate_mean_tcn();
if (bafEmit->is_set){
Expand Down Expand Up @@ -916,9 +920,14 @@ void print_all_results( Clone * myClone, cmdl_opts& opts){
// CNA llhs/gofs per sample...
fprintf(clonal_fp, "# cna-gof\n");
for (int t=0; t<nT; t++){
fprintf( clonal_fp, "%.4f\n",
//myClone->cna_llhs[t],
myClone->cna_gofs[t] / double(cnaEmit->total_loci));
unsigned int total_nObs = 0;
for (int s=0;s<cnaEmit->nSamples; s++){
for (int evt=0; evt<cnaEmit->nEvents[s]; evt++){
total_nObs += cnaEmit->nObs[t][s][evt];
}
}
if (total_nObs==0 && myClone->cna_gofs[t] > 0) abort();
fprintf( clonal_fp, "%.4f\n", myClone->cna_gofs[t] / double(total_nObs));
}
if (cnaEmit->coarse_grained) print_gof( myClone, cnaEmit, opts);
//
Expand Down Expand Up @@ -953,9 +962,14 @@ void print_all_results( Clone * myClone, cmdl_opts& opts){
//llh and gof per sample
fprintf(clonal_fp, "# baf-gof\n");
for (int t=0; t<nT; t++){
fprintf(clonal_fp, "%.4f\n",
//myClone->baf_llhs[t],
myClone->baf_gofs[t] / double(bafEmit->total_loci));
unsigned int total_nObs = 0;
for (int s=0;s<bafEmit->nSamples; s++){
for (int evt=0; evt<bafEmit->nEvents[s]; evt++){
total_nObs += bafEmit->nObs[t][s][evt];
}
}
if (total_nObs==0 && myClone->baf_gofs[t] > 0) abort();
fprintf(clonal_fp, "%.4f\n", myClone->baf_gofs[t] / double(total_nObs));
}
if (bafEmit->coarse_grained) print_gof( myClone, bafEmit, opts);
}
Expand Down Expand Up @@ -991,9 +1005,14 @@ void print_all_results( Clone * myClone, cmdl_opts& opts){
delete [] myClone->alpha_snv;
fprintf(clonal_fp, "# snv-gof\n");
for (int t=0; t<nT; t++){
fprintf( clonal_fp, "%.4f\n",
//myClone->snv_llhs[t],
myClone->snv_gofs[t] / double(snvEmit->total_loci));
unsigned int total_nObs = 0;
for (int s=0;s<snvEmit->nSamples; s++){
for (int evt=0; evt<snvEmit->nEvents[s]; evt++){
total_nObs += snvEmit->nObs[t][s][evt];
}
}
if (total_nObs==0 && myClone->snv_gofs[t] > 0) abort();
fprintf( clonal_fp, "%.4f\n", myClone->snv_gofs[t] / double(total_nObs));
}
if (snvEmit->coarse_grained) print_gof( myClone, snvEmit, opts);
}
Expand Down Expand Up @@ -1039,15 +1058,21 @@ void print_all_results( Clone * myClone, cmdl_opts& opts){
delete [] myClone->alpha_snv;
fprintf(clonal_fp, "# snv-gof\n");
for (int t=0; t<nT; t++){
fprintf( clonal_fp, "%.4f\n",
//myClone->snv_llhs[t],
myClone->snv_gofs[t] / double(snvEmit->total_loci));
unsigned int total_nObs = 0;
for (int s=0;s<snvEmit->nSamples; s++){
for (int evt=0; evt<snvEmit->nEvents[s]; evt++){
total_nObs += snvEmit->nObs[t][s][evt];
}
}
if (total_nObs==0 && myClone->snv_gofs[t] > 0) abort();
fprintf( clonal_fp, "%.4f\n", myClone->snv_gofs[t] / double(total_nObs));
}
if (snvEmit->coarse_grained) print_gof( myClone, snvEmit, opts);
}
//
fprintf(clonal_fp, "# cna-ent baf-ent snv-ent\n");
fprintf(clonal_fp, "%.4e %.4e %.4e\n", myClone->cna_total_ent, myClone->baf_total_ent, myClone->snv_total_ent);
fprintf( clonal_fp, "# cna-ent baf-ent snv-ent\n");
fprintf( clonal_fp, "%.4e %.4e %.4e\n",
myClone->cna_total_ent, myClone->baf_total_ent, myClone->snv_total_ent);
fclose(clonal_fp);
if (snv_utcn_fp != NULL) fclose(snv_utcn_fp);
if (baf_utcn_fp != NULL) fclose(baf_utcn_fp);
Expand Down Expand Up @@ -1316,7 +1341,9 @@ void print_gof( Clone * myClone, Emission * myEmit, cmdl_opts& opts){
myEmit->chr[s], myEmit->loci[s][first], last-first+1, myEmit->loci[s][last]
);
for (int t=0; t< myClone->nTimes; t++){
fprintf( gof_fp, " %12.2f", gofs[t][s][evt]);
fprintf( gof_fp, " %.4f",
gofs[t][s][evt] > 0 ? gofs[t][s][evt] / double(myEmit->nObs[t][s][evt]) : 0.0
);
}
fprintf( gof_fp,"\n");
}
Expand Down
38 changes: 37 additions & 1 deletion src/emission.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ Emission::Emission(){
coarse_grained = 0;
idx_to_Event_mapped=0;
connect=0;
nObs = NULL;
}


Expand Down Expand Up @@ -259,7 +260,42 @@ void Emission::allocate_av_cn(int maxtcn){//only once!
}
}


void Emission::get_nObs(){
if (nObs!=NULL) abort();
nObs = new unsigned int ** [nTimes];
for (int t=0; t<nTimes; t++){
nObs[t] = new unsigned int * [nSamples];
for (int s=0; s<nSamples; s++){
if (nEvents[s] == 0){
nObs[t][s] = NULL;
continue;
}
nObs[t][s] = new unsigned int [nEvents[s]];
}
}
for (int s=0; s<nSamples; s++){
for (int evt=0; evt<nEvents[s]; evt++){
int first = idx_of_event[s][evt];
int last
= (evt < nEvents[s]-1)
? idx_of_event[s][evt+1] - 1
: nSites[s] - 1;
unsigned int n,N;
for (int t=0; t<nTimes; t++){
nObs[t][s][evt] = 0;
for (int idx=first; idx<=last; idx++){
n = reads[t][s][idx];
N = depths[t][s][idx];
if (N==0){
if (n>0) abort();
continue;
}
nObs[t][s][evt]++;
}
}
}
}
}


//Destructor
Expand Down
2 changes: 2 additions & 0 deletions src/emission.h
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,8 @@ class Emission{
unsigned int ** loci;
unsigned int ** mask;
unsigned int ** dist;
unsigned int *** nObs;
void get_nObs();
int * nSites;
int * chr;
std::set<int> chrs;
Expand Down
2 changes: 1 addition & 1 deletion src/filterHD.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -203,7 +203,7 @@ int main (int argc, const char * argv[]){
jumps[s][l] *= exp(myJD.pnojump[s][l]);
}
//goodness of fit...
if (mask==NULL || mask[s][l] == 1){
if ((mask==NULL || mask[s][l] == 1) && myEmit.depths[t][s][l] > 0){
xobs = double(myEmit.reads[t][s][l]) / double(myEmit.depths[t][s][l]);
if (opts.reflect) xobs = min(xobs,1.0-xobs);
double x=0,g=0,dg=0;
Expand Down
Loading

0 comments on commit aabcd3a

Please sign in to comment.