Skip to content

Commit

Permalink
Merge branch 'develop'
Browse files Browse the repository at this point in the history
  • Loading branch information
andrej-fischer committed May 29, 2014
2 parents e751fd5 + c5fedcd commit 8a4d7e8
Show file tree
Hide file tree
Showing 17 changed files with 213 additions and 94 deletions.
7 changes: 7 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,3 +108,10 @@ prediction (red).
of the bias field might not be very useful, especially in segmenting the tumor data.
Use rather, if available, the jumps seen in the BAF data for both CNA and BAF data
(give the BAF jumps file to both `--cna-jumps` and `--baf-jumps`).

#How to cite

The cloneHD and filterHD software is free under the GNU General Public License v3.
If you use this software in your work, please cite the accompanying publication:

Andrej Fischer, Ignacio Vazquez-Garcia, Christopher J.R. Illingworth and Ville Mustonen. High-definition reconstruction of subclonal composition in cancer. Cell Reports (2014), http://dx.doi.org/ 10.1016/j.celrep.2014.04.055
9 changes: 5 additions & 4 deletions TODO.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,6 @@

## cloneHD

* with `--purity` and `--clones` (mass only), `--nmax` is ignored
* if using BAF-jumps for CNA: mapping to be checked
* print full if `--*jump` is (also) given, only events if `--*jumps`
* check memory leaks

Expand All @@ -17,14 +15,17 @@

* filter loci incompatible with emission model via 2-state HMM
* do Baum-Welch
* `--filter-shortSeg [int]` via posterior jump-prob (not via pmean)!
* `--filter-shortSeg [int]` via posterior jump-prob (not via pmean)
* Use diffusion constant that scales with size in mode 3/4.
* Implement jump subtraction

## cloneHD

* bias field as distribution
* re-think update_snp_fixed (hashing)
* different nLevels per chr via max-tcn (for memory only)
* different nLevels per chr via max-tcn (for memory efficiency only)

## both

* string chromosome ids
* re-factor emission class
15 changes: 14 additions & 1 deletion changelog.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,19 @@
# changelog for cloneHD/filterHD

## v1.17.6 / to come
## v1.17.8 / 29.05.2014

* added checks whether files are open for writing
* changed to new defaults: `--(cna/baf/snv)-rnd [double:1.0e-6]` (nan)
* allowed `--cna-jump -1` and `--baf-jump -1` (no jumps)
* `--cna-jumps [baf-jumps-file]` and vice versa enabled (useful for exome data)
* jumps read and integrated with new function match_jumps() (not get_track()).
* fixed bug when chromosomes have no non-zero observations.

## v1.17.7 / 25.04.2014

* fixed range error in `pre-filter` in pick-from/match-to mode.

## v1.17.6 / 24.04.2014

* fixed nan bug in GOF, when N==0 (missing data).
* fixed bugs in `pre-filter`, when `--window-size` is greater than length
Expand Down
6 changes: 3 additions & 3 deletions docs/README-cloneHD.md
Original file line number Diff line number Diff line change
Expand Up @@ -116,9 +116,9 @@ models are used (Poisson or Binomial).

The rate for indiviual random emissions per data set.

* `--cna-rnd [double:0.0]`
* `--baf-rnd [double:0.0]`
* `--snv-rnd [double:0.0]`
* `--cna-rnd [double:1.0e-6]`
* `--baf-rnd [double:1.0e-6]`
* `--snv-rnd [double:1.0e-6]`

Both can be learned with filterHD for data with persistence.

Expand Down
5 changes: 5 additions & 0 deletions docs/README-pre-filter.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,11 @@

The program `pre-filter` can be used to remove loci based on the observed read depth. It includes two heuristic filtering methods: loci are removed based on (i) their local variability and (ii) their being an outlier (see below).

![pref](/images/prefilter-1.png "Pre-filtering of read depth via matched normal.")

The effect of `pre-filter` on read depth data: (A) Centromeric regions of real chromosomes often show huge large scale variability in their read depth. But there are also many small regions with very low read depth throughout. (B) After pre-filtering, the problematic regions are masked out. (C) Removing the same regions in the tumor data improves quality visibly while retaining biologically relevant features.


## Typical usage options

* `--data [file]` Input data to be pre-filtered.
Expand Down
Binary file added images/filterHD-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/prefilter-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions run-example.sh
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,7 @@ if [ -z $part ] || [ $part -eq 2 ]
then
### cloneHD ###
echo "*** cloneHD ***"
echo "True mass and cell fractions:" `cat test/data/clones.txt`
echo
# The CNA and BAF data is analysed for subclonality.
# Try varying the --min-jump, --force and --max-tcn values and try --mass-gauging 0.
Expand Down
2 changes: 2 additions & 0 deletions src/clone-prior.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,8 @@ void Clone::set_cna_prior( gsl_vector * prior, int sample){
//used in SNV-only mode, w/o correlation and w/o cn-info...
void Clone::initialize_snv_prior_param(){// SNV prior, conditional on max-tcn
if (initial_snv_prior_param != NULL) gsl_matrix_free(initial_snv_prior_param);
initial_snv_prior_param = NULL;
if (nClones == 0) return;
initial_snv_prior_param = gsl_matrix_calloc( maxtcn+1, maxtcn+1);
gsl_matrix_set( initial_snv_prior_param, 0, 0, snv_fpr);
double p = snv_pen_high;// penalty for higher genotypes
Expand Down
5 changes: 4 additions & 1 deletion src/clone-update.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,10 @@ double Clone::update( gsl_vector * prior, gsl_vector * post, Emission * myEmit,
}
gsl_vector_mul( post, prior);
norm = gsl_blas_dasum(post);
if ( norm != norm || norm <= 0.0) abort();
if ( norm != norm || norm <= 0.0){
cout<<"ERROR\n";
abort();
}
gsl_vector_scale( post, 1.0 / norm);
return( log(norm) );
}
Expand Down
111 changes: 87 additions & 24 deletions src/cloneHD-functions.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -119,41 +119,45 @@ void get_track(const char * track_fn,
line_ss.str(line);
line_ss >> chr >> locus;
l = (chr == old_chr) ? l+1 : 0;
if (chr != old_chr && (chr > myEmit->maxchr || myEmit->idx_of[chr] < 0) ){
printf("ERROR: unexpected chromosome.\n");
if (chr != old_chr && myEmit->chrs.count(chr) == 0){
printf("ERROR: unexpected chromosome in %s.\n", track_fn);
cout<<line<<endl;
exit(1);
}
// exit(0);
old_chr = chr;
if( (int) myEmit->loci[ myEmit->idx_of[chr] ][l] != locus){
printf("ERROR in get_track()\n");
int s = myEmit->idx_of[chr];
if( (int) myEmit->loci[s][l] != locus){
printf("ERROR in get_track(): unexpected coordinate in %s\n", track_fn);
cout<<line<<endl;
printf( "%i, %i, %i vs %i\n", myEmit->idx_of[chr], l, myEmit->loci[ myEmit->idx_of[chr] ][l], locus);
printf("Expected:\n%i, %i, %i vs %i\n", s, l, myEmit->loci[s][l], locus);
exit(1);
}
line_ss >> mn;
mean[ myEmit->idx_of[chr] ][l] = mn;
mean[s][l] = mn;
if (var != NULL) {
line_ss >> sd;
var[ myEmit->idx_of[chr] ][l] = pow(sd,2);
var[s][l] = pow(sd,2);
}
if (distribution == NULL) continue;
double p;
line_ss >> jp;
for (int i=0; i < (int) (distribution[ myEmit->idx_of[chr] ])->size2; i++){
for (int i=0; i < (int) (distribution[s])->size2; i++){
if (line_ss.good() != true){
printf("ERROR in get_track()\n");
printf("ERROR in get_track(): distribution grid too small in %s\n",track_fn);
exit(1);
}
line_ss >> p;
gsl_matrix_set( distribution[ myEmit->idx_of[chr] ], l, i, p);
gsl_matrix_set( distribution[s], l, i, p);
}
}
ifs.close();
}





// get the maximum total c.n. per chromosome
void get_maxtcn_input(const char * maxtcn_fn, int maxtcn_gw, Clone * myClone){
myClone->maxtcn_input.clear();
Expand Down Expand Up @@ -241,7 +245,7 @@ void get_maxtcn_input(const char * maxtcn_fn, int maxtcn_gw, Clone * myClone){



//get total copynumber tracks from file...
//get mean total copynumber from file...
void get_mean_tcn( const char * mntcn_fn, Clone * myClone, Emission * myEmit){
ifstream ifs;
string line;
Expand Down Expand Up @@ -605,6 +609,53 @@ void get_fixed_clones(gsl_matrix *& clones, gsl_vector *& mass, const char * clo



void match_jumps(const char * jumps_fn, Emission * myEmit){
ifstream ifs;
string line;
stringstream line_ss;
ifs.open( jumps_fn, ios::in);
if (ifs.fail()){
printf("ERROR file %s cannot be opened.\n", jumps_fn);
exit(1);
}
int chr=0, old_chr = -1;
int in_locus=0, locus=-1;
int s=-1,idx=0;
int wait=0;
double pj=0;
while( ifs.good() ){
line.clear();
getline( ifs, line);
if (line.empty()) break;
if (line[0] == '#') continue;
line_ss.clear();
line_ss.str(line);
line_ss >> chr >> in_locus;
if (chr != old_chr){
wait = (myEmit->chrs.count(chr)) == 0 ? 1 : 0;
idx=1;
}
old_chr = chr;
if (wait) continue;
s = myEmit->idx_of[chr];
if (idx == myEmit->nSites[s]) continue;//chromosome is complete!
locus = (int) myEmit->loci[s][idx];//current target locus
if (in_locus < (int) myEmit->loci[s][0]) continue;
while(in_locus > locus){
idx++;
if (idx == myEmit->nSites[s]) break;
locus = (int) myEmit->loci[s][idx];
}
if (idx == myEmit->nSites[s]) continue;
line_ss >> pj;
myEmit->pjump[s][idx] = 1.0 - (1.0-pj)*(1.0-myEmit->pjump[s][idx]);
}
//done
ifs.close();
}



//***posterior jump probability track***
void get_jump_probability( Clone * myClone, cmdl_opts& opts){
Emission * cnaEmit = myClone->cnaEmit;
Expand All @@ -614,7 +665,8 @@ void get_jump_probability( Clone * myClone, cmdl_opts& opts){
gsl_matrix ** distdummy = NULL;
if (cnaEmit->is_set){//***CNA JUMPS***
if(opts.cna_jumps_fn != NULL){//1. either use external jump probability track
get_track( opts.cna_jumps_fn, distdummy, cnaEmit->pjump, vardummy, cnaEmit);
//get_track( opts.cna_jumps_fn, distdummy, cnaEmit->pjump, vardummy, cnaEmit);
match_jumps(opts.cna_jumps_fn, cnaEmit);
for (int s=0; s< cnaEmit->nSamples; s++){
cnaEmit->coarse_grain_jumps( s, opts.min_jump, 5);
}
Expand All @@ -624,9 +676,6 @@ void get_jump_probability( Clone * myClone, cmdl_opts& opts){
cnaEmit->set_pjump(opts.cna_jump);
cnaEmit->get_events_via_jumps();
}
else{
abort();
}
//map CNA events to BAF and SNV...
if(bafEmit->is_set){
for (int s=0; s<bafEmit->nSamples; s++){
Expand All @@ -635,35 +684,49 @@ void get_jump_probability( Clone * myClone, cmdl_opts& opts){
}
if (snvEmit->is_set){
for (int s=0; s<snvEmit->nSamples; s++){
if ( !bafEmit->is_set ||bafEmit->chrs.count(snvEmit->chr[s]) == 0){
if ( !bafEmit->is_set || bafEmit->chrs.count(snvEmit->chr[s]) == 0){
snvEmit->map_idx_to_Event( cnaEmit, s);
}
}
}
}
if ( bafEmit->is_set ){//***BAF JUMPS***
if (!cnaEmit->is_set) abort();
if (opts.baf_jumps_fn != NULL){//1. either external jump probability track
get_track( opts.baf_jumps_fn, distdummy, bafEmit->pjump, vardummy, bafEmit);
// get_track( opts.baf_jumps_fn, distdummy, bafEmit->pjump, vardummy, bafEmit);
match_jumps( opts.baf_jumps_fn, bafEmit);
if ( opts.cna_jumps_fn != NULL//allow transit at CNA jumps
&& opts.cna_jumps_fn != opts.baf_jumps_fn//but not the same jumps
){
match_jumps( opts.cna_jumps_fn, bafEmit);
}
bafEmit->add_break_points_via_jumps( cnaEmit, opts.min_jump);
for (int s=0; s< bafEmit->nSamples; s++){// ignore improbable jump events
bafEmit->coarse_grain_jumps( s, opts.min_jump, 5);
}
bafEmit->get_events_via_jumps();
if ( cnaEmit->is_set && opts.cna_jumps_fn != NULL ){//allow transit at CNA jumps
bafEmit->add_break_points_via_jumps( cnaEmit, opts.min_jump);
/*
cout<<"BAF-jumps\n";
for (int evt=0;evt<bafEmit->nEvents[0];evt++){
int idx = bafEmit->idx_of_event[0][evt];
printf("%3i %6i %6i\n", evt, idx, bafEmit->loci[0][idx]);
}
bafEmit->get_events_via_jumps();
cout<<"CNA-jumps\n";
for (int evt=0;evt<cnaEmit->nEvents[0];evt++){
int idx = cnaEmit->idx_of_event[0][evt];
printf("%3i %6i %6i\n", evt, idx, cnaEmit->loci[0][idx]);
}
exit(0);
*/
}
else if ( cnaEmit->is_set && opts.cna_jumps_fn != NULL ){//2. map the CNA jumps to BAF
else if (cnaEmit->is_set && opts.cna_jumps_fn != NULL ){//2. map the CNA jumps to BAF
bafEmit->map_jumps(cnaEmit);
bafEmit->get_events_via_jumps();
}
else if (opts.baf_jump >= 0.0){//or 3. constant jump probability per base
bafEmit->set_pjump(opts.baf_jump);
bafEmit->get_events_via_jumps();
}
else{
abort();
}
// map BAF events to SNV
if (snvEmit->is_set){
for (int s=0; s<snvEmit->nSamples; s++){
Expand Down
2 changes: 1 addition & 1 deletion src/cloneHD-functions.h
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ void get_baf_data(Emission * bafEmit, cmdl_opts& opts, int& nTimes, int& nT);
void get_snv_data(Emission * snvEmit, cmdl_opts& opts, int& nTimes, int& nT);
void get_snv_bulk_prior( Clone * myClone, cmdl_opts& opts);
void get_track(const char * fn, gsl_matrix **& dist, double **& mn, double **& var, Emission * myEmit);

void match_jumps(const char * jumps_fn, Emission * myEmit);
void get_maxtcn_input(const char * maxtcn_fn, int maxtcn_gw, Clone * myClone);
void get_mean_tcn( const char * mtcn_fn, Clone * myClone, Emission * myEmit);
void get_avail_cn( const char * avcn_fn, Clone * myClone, Emission * myEmit);
Expand Down
8 changes: 6 additions & 2 deletions src/cloneHD-inference.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,11 @@ int infer_clones( gsl_matrix * Clones, gsl_vector * Mass, Clone * myClone, cmdl_
char clonal_out[1024];
sprintf( clonal_out, "%s.summary.txt", opts.pre);
FILE * clonal_fp = fopen(clonal_out,"w");
fprintf(clonal_fp, "# n cna-llh baf-llh snv-llh total-llh total-bic\n");
if (clonal_fp == NULL){
printf("ERROR: file %s could not be opened for writing.\n", clonal_out);
exit(1);
}
fprintf( clonal_fp, "# n cna-llh baf-llh snv-llh total-llh total-bic\n");
//containers for the best estimates, given n...
gsl_vector ** best_mass = new gsl_vector * [opts.nmax+1];
gsl_matrix ** best_clones = new gsl_matrix * [opts.nmax+1];
Expand Down Expand Up @@ -250,7 +254,7 @@ int infer_clones( gsl_matrix * Clones, gsl_vector * Mass, Clone * myClone, cmdl_
if (best_mass[bestn] != NULL){
myClone->set_mass(best_mass[bestn]);
}
if (!cnaEmit->is_set && snvEmit->is_set && !snvEmit->connect){
if (!cnaEmit->is_set && snvEmit->is_set && !snvEmit->connect && bestn > 0){
myClone->initialize_snv_prior_param();
}
if (best_priors[bestn] != NULL && bestn > 0){
Expand Down
11 changes: 6 additions & 5 deletions src/cloneHD.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -86,8 +86,8 @@ int main (int argc, const char * argv[]){
myClone.cna_pen_diff = opts.cna_pen_diff;//CNA penalty for different c.n.
myClone.cna_pen_norm = opts.cna_pen_norm;//CNA penalty for non-normal c.n.
myClone.baf_pen_comp = opts.baf_pen_comp;//BAF penalty for complex chr status
myClone.snv_pen_high = opts.snv_pen_high;//penalty for high SNV genotypes
myClone.snv_pen_mult = opts.snv_pen_mult;//penalty for multiple hit SNV
myClone.snv_pen_high = opts.snv_pen_high;//SNV penalty for high SNV genotypes
myClone.snv_pen_mult = opts.snv_pen_mult;//SNV penalty for multiple hit SNVs
myClone.snv_fpr = opts.snv_fpr;//SNV false-positive rate
myClone.snv_fpf = opts.snv_fpf;//SNV frequency of false positives
myClone.bulk_fix = opts.bulk_fix;
Expand Down Expand Up @@ -147,6 +147,7 @@ int main (int argc, const char * argv[]){
cout<<"done."<<endl;
}
cout<<endl;
//exit(0);
// get purities...
if (opts.purity_fn != NULL){
get_purity( opts.purity_fn, myClone.min_purity);
Expand Down Expand Up @@ -404,9 +405,9 @@ void default_opts(cmdl_opts& opts){
opts.baf_jump = -1.0;
opts.snv_jump = -1.0;
//random error rates...
opts.cna_rnd = 0.0;
opts.baf_rnd = 0.0;
opts.snv_rnd = 0.0;
opts.cna_rnd = 1.0e-6;
opts.baf_rnd = 1.0e-6;
opts.snv_rnd = 1.0e-6;
opts.snv_fpf = 0.01;
opts.snv_fpr = 0.001;
//shape parameters...
Expand Down
Loading

0 comments on commit 8a4d7e8

Please sign in to comment.